[go: up one dir, main page]

CN107204868A - A kind of task run monitoring information acquisition methods and device - Google Patents

A kind of task run monitoring information acquisition methods and device Download PDF

Info

Publication number
CN107204868A
CN107204868A CN201610158804.3A CN201610158804A CN107204868A CN 107204868 A CN107204868 A CN 107204868A CN 201610158804 A CN201610158804 A CN 201610158804A CN 107204868 A CN107204868 A CN 107204868A
Authority
CN
China
Prior art keywords
task
monitoring information
platform
information
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610158804.3A
Other languages
Chinese (zh)
Other versions
CN107204868B (en
Inventor
卢山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Shanxi Co Ltd
Original Assignee
China Mobile Group Shanxi Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Shanxi Co Ltd filed Critical China Mobile Group Shanxi Co Ltd
Priority to CN201610158804.3A priority Critical patent/CN107204868B/en
Publication of CN107204868A publication Critical patent/CN107204868A/en
Application granted granted Critical
Publication of CN107204868B publication Critical patent/CN107204868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明公开了一种任务运行监控信息获取方法,根据任务所处的各任务节点的运行信息,分别设置任务标识;所述方法还包括:根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息、和/或平台监控信息、和/或设备监控信息,确定故障信息。本发明还公开了一种任务运行监控信息获取装置。

The invention discloses a method for obtaining task operation monitoring information. According to the operation information of each task node where the task is located, task identifiers are respectively set; The task monitoring information, and/or platform monitoring information, and/or device monitoring information of the task node corresponding to the task identification content of the failure point determines the failure information. The invention also discloses a task operation monitoring information acquisition device.

Description

一种任务运行监控信息获取方法和装置Method and device for acquiring task operation monitoring information

技术领域technical field

本发明涉及大数据运营管理技术,尤其涉及一种任务运行监控信息获取方法和装置。The invention relates to big data operation management technology, in particular to a method and device for acquiring task operation monitoring information.

背景技术Background technique

目前,大数据应用发展迅速,在各个领域都有着广泛的运用;各个技术平台任务运行都有各自独特优势,但自成体系,各自为政;大数据平台由于需要整合不同技术体系的优势,满足不同层面的业务需求,因此,建设异构、技术混搭、融合部署、多业务联动的运营模式成为大数据平台不同于传统平台的显著特点。在这种情况下,在出现故障的时候,故障排查步骤基本如下:At present, big data applications are developing rapidly and are widely used in various fields; each technology platform has its own unique advantages in task operation, but it is self-contained and independent; the big data platform needs to integrate the advantages of different technology systems to meet the needs of different levels. Therefore, building a heterogeneous, technology-mashup, integrated deployment, and multi-service linkage operating model has become a distinctive feature of a big data platform that is different from traditional platforms. In this case, when a failure occurs, the troubleshooting steps are basically as follows:

一、故障定位:由于大数据平台任务运行散落在各个平台中,各个平台的监控信息管理系统彼此独立,各自为政,按照自身的标准和规则工作,发生故障时,运维人员需要人工方式在各个平台的管理平台上交叉比对,核对各个平台发现的故障信息,人工剔除次要和关联告警,确认故障告警和故障原因;1. Fault location: Since the task operation of the big data platform is scattered in each platform, the monitoring information management systems of each platform are independent of each other, and they work according to their own standards and rules. When a fault occurs, the operation and maintenance personnel need to manually check each platform. Cross-comparison on the management platform, check the fault information found on each platform, manually eliminate minor and related alarms, and confirm the fault alarm and the cause of the fault;

二、故障分析:由于各个系统的数据各自独立,目前大部分的故障分析都是在本系统内进行故障分析,采用人工方式对各个系统的数据进行汇总、关联,当平台任务众多,运行逻辑关系复杂时,基本无法判断相互交叉关系,分析只能从系统整体逐步细化,这个过程耗时长,并且无法在第一时间确认故障影响度、影响范围;2. Fault analysis: Since the data of each system is independent, most of the current fault analysis is carried out in the system, and the data of each system is summarized and correlated manually. When there are many tasks on the platform, the logical relationship between operations When it is complicated, it is basically impossible to judge the mutual cross-relationship, and the analysis can only be gradually refined from the whole system. This process takes a long time, and it is impossible to confirm the impact degree and scope of the fault at the first time;

三、故障解决:在经过各个厂商的信息汇总,人工确定故障点后,需要协调各个厂商一起解决,各个厂商只负责本系统的故障解决,不考虑本系统故障解决对别的系统的影响,无法站在系统架构层次进行故障的整体把握;解决完成后,还要在各个系统的管控平台上分别确认解决情况,人工判断故障是否解决,是否产生新的问题等等。3. Troubleshooting: After summarizing the information of each manufacturer and manually determining the fault point, it is necessary to coordinate with each manufacturer to solve it together. Each manufacturer is only responsible for the fault resolution of the system, regardless of the impact of the fault resolution of this system on other systems. At the system architecture level, we can grasp the fault as a whole; after the solution is completed, we need to confirm the solution on the management and control platform of each system, and manually judge whether the fault has been solved or whether new problems have arisen, etc.

在现有条件下,大数据平台的故障业务影响评估、故障分析效率、告警准确性存在如下缺点:现在的故障定位都是不同厂商各自分析,特别是大数据平台多技术混搭下,随着平台任务上线越来越多,无法真正了解平台任务之间的逻辑关联关系,难以理清任务之间的运行依赖关系,导致缺乏全面的故障分析能力,让故障对业务的影响难以准确的评估;现有的故障分析在各种厂商大数据技术运营管理水平参差不齐的背景下,特别在Spark、Storm、Sqoop、HIVE,HBASE等诸多技术组件混合使用情况下,无法理清组件之间的系统故障关联影响,在故障定位过程中只能每个组件从头检查,定位,让整个故障的解决时效延长;现有的故障监控都是各自为主,每个任务故障、平台故障、设备故障都有自己的界面和信息,缺乏信息的关联融合,发生故障时,不同层面出现大量告警信息,无法对上述告警信息进行有效关联过滤,形成告警风暴,让运维人员无所适从。需要人工方式汇总、分析各个平台的告警信息,剔除次要告警和关联告警,找到故障原因,对专家级的人员依赖度大,故障处理效率低。Under the existing conditions, the business impact assessment of faults, fault analysis efficiency, and alarm accuracy of the big data platform have the following shortcomings: the current fault location is analyzed by different manufacturers, especially under the mixed technology of the big data platform, as the platform More and more tasks are launched online, it is impossible to truly understand the logical relationship between platform tasks, and it is difficult to clarify the operational dependencies between tasks, resulting in a lack of comprehensive fault analysis capabilities, making it difficult to accurately assess the impact of faults on business; In some fault analysis, under the background of uneven big data technology operation and management levels of various manufacturers, especially in the mixed use of Spark, Storm, Sqoop, HIVE, HBASE and many other technical components, it is impossible to clarify the system faults between components Correlation impact, in the process of fault location, each component can only be inspected and located from the beginning, so that the time for solving the entire fault can be extended; the existing fault monitoring is independent, and each task fault, platform fault, and equipment fault has its own interface and information, lack of correlation and fusion of information, when a fault occurs, a large number of alarm information appears at different levels, and the above alarm information cannot be effectively correlated and filtered, forming an alarm storm, which makes the operation and maintenance personnel at a loss. It is necessary to manually summarize and analyze the alarm information of each platform, eliminate minor alarms and related alarms, and find the cause of the fault. The dependence on expert personnel is high, and the fault handling efficiency is low.

可见,提高大数据平台故障的快速定位,实现故障影响的自动化分析,提高故障解决及时性,是亟待解决的问题。It can be seen that improving the rapid location of faults on the big data platform, realizing automatic analysis of fault impacts, and improving the timeliness of fault resolution are urgent problems to be solved.

发明内容Contents of the invention

有鉴于此,本发明实施例期望提供一种任务运行监控信息获取方法和装置,提高大数据平台故障的快速定位,实现故障影响的自动化分析,提高故障解决及时性。In view of this, the embodiments of the present invention expect to provide a method and device for obtaining task operation monitoring information, which can improve the rapid location of faults on the big data platform, realize automatic analysis of fault impacts, and improve the timeliness of fault resolution.

为达到上述目的,本发明的技术方案是这样实现的:In order to achieve the above object, technical solution of the present invention is achieved in that way:

本发明实施例提供了一种任务运行监控信息获取方法,所述方法包括:根据任务所处的各任务节点的运行信息,分别设置任务标识;所述方法还包括:An embodiment of the present invention provides a method for obtaining task operation monitoring information. The method includes: setting task identifiers respectively according to the operation information of each task node where the task is located; the method also includes:

根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息、和/或平台监控信息、和/或设备监控信息,确定故障信息。Obtain task monitoring information, platform monitoring information, and/or device monitoring information of the task node corresponding to the failure point task identification content in the monitoring information according to the failure point task identification, and determine the failure information.

上述方案中,所述运行信息包括:运行平台类型、和/或运行平台组件类型;In the above solution, the operation information includes: type of operation platform, and/or type of components of operation platform;

所述任务标识的内容包括:所述运行平台类型、和/或所述运行平台组件类型、和/或任务序列号;The content of the task identification includes: the operating platform type, and/or the operating platform component type, and/or the task sequence number;

所述任务序列号包括:根据所述任务预设的唯一的标识号,或根据所述任务节点预设的唯一的标识号。The task sequence number includes: a unique identification number preset according to the task, or a unique identification number preset according to the task node.

上述方案中,所述根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息,确定故障信息;包括:In the above scheme, the task monitoring information of the task node corresponding to the content of the task identification of the failure point is obtained from the monitoring information according to the task identification of the failure point, and the failure information is determined; including:

预先关联所述任务序列号与所述任务监控信息;Associating the task sequence number with the task monitoring information in advance;

根据所述故障点任务标识中的任务序列号,确定所述故障点任务标识对应的任务节点的任务监控信息。According to the task sequence number in the task identifier of the failure point, determine the task monitoring information of the task node corresponding to the task identifier of the failure point.

上述方案中,所述根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的平台监控信息,确定故障信息;包括:In the above scheme, the platform monitoring information of the task node corresponding to the content of the task identification of the failure point is obtained from the monitoring information according to the task identification of the failure point, and the failure information is determined; including:

根据所述故障点任务标识中的运行平台类型和运行平台组件类型,确定所述故障点任务标识对应的任务节点的平台类型和平台组件类型;Determine the platform type and platform component type of the task node corresponding to the fault point task ID according to the operating platform type and operating platform component type in the fault point task ID;

根据所述故障点任务标识对应的任务节点的平台类型和平台组件类型,确定所述故障点任务标识对应的任务节点的平台类型和平台组件类型对应的所述平台监控信息。According to the platform type and platform component type of the task node corresponding to the fault point task identifier, determine the platform monitoring information corresponding to the platform type of the task node corresponding to the fault point task identifier and the platform component type.

上述方案中,所述根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的设备监控信息,确定故障信息;包括:In the above solution, according to the task identification of the failure point, the equipment monitoring information of the task node corresponding to the content of the task identification of the failure point is obtained from the monitoring information, and the failure information is determined; including:

在任务执行日志中检索所述故障点任务标识中的运行平台类型和运行平台组件类型运行的设备主机名;In the task execution log, retrieve the running platform type and the running device host name of the running platform component type in the fault point task identifier;

根据设备主机名,确定所述故障点任务标识对应任务节点的设备监控信息。According to the host name of the device, determine the device monitoring information of the task node corresponding to the fault point task identifier.

本发明实施例还提供了一种任务运行监控信息获取装置,所述装置包括:设置装置和确定装置,其中,An embodiment of the present invention also provides a task operation monitoring information acquisition device, the device includes: a setting device and a determining device, wherein,

所述设置装置,用于根据任务所处的各任务节点的运行信息,分别设置任务标识;The setting device is used to respectively set task identifiers according to the operation information of each task node where the task is located;

所述确定装置,根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息、和/或平台监控信息、和/或设备监控信息,确定故障信息。The determining means obtains task monitoring information, and/or platform monitoring information, and/or equipment monitoring information of the task node corresponding to the content of the fault point task ID in the monitoring information according to the fault point task ID, and determines the fault information .

上述方案中,所述运行信息包括:运行平台类型、和/或运行平台组件类型;In the above solution, the operation information includes: type of operation platform, and/or type of components of operation platform;

所述任务标识的内容包括:所述运行平台类型、和/或所述运行平台组件类型、和/或任务序列号;The content of the task identification includes: the operating platform type, and/or the operating platform component type, and/or the task sequence number;

所述任务序列号包括:根据所述任务预设的唯一的标识号,或根据所述任务节点预设的唯一的标识号。The task sequence number includes: a unique identification number preset according to the task, or a unique identification number preset according to the task node.

上述方案中,所述确定装置具体用于:In the above solution, the determining device is specifically used for:

预先关联所述任务序列号与所述任务监控信息;Associating the task sequence number with the task monitoring information in advance;

所述确定故障信息,包括:根据所述故障点任务标识中的任务序列号,确定所述故障点任务标识对应的任务节点的任务监控信息。The determining the failure information includes: according to the task sequence number in the task identifier of the failure point, determining the task monitoring information of the task node corresponding to the task identifier of the failure point.

上述方案中,所述确定装置具体用于:In the above solution, the determining device is specifically used for:

根据所述故障点任务标识中的运行平台类型和运行平台组件类型,确定所述故障点任务标识对应的任务节点的平台类型和平台组件类型;Determine the platform type and platform component type of the task node corresponding to the fault point task ID according to the operating platform type and operating platform component type in the fault point task ID;

根据所述故障点任务标识对应的任务节点的平台类型和平台组件类型,确定所述故障点任务标识对应的任务节点的平台类型和平台组件类型对应的所述平台监控信息。According to the platform type and platform component type of the task node corresponding to the fault point task identifier, determine the platform monitoring information corresponding to the platform type of the task node corresponding to the fault point task identifier and the platform component type.

上述方案中,所述确定装置具体用于:In the above solution, the determining device is specifically used for:

在任务执行日志中检索所述故障点任务标识中的运行平台类型和运行平台组件类型运行的设备主机名;In the task execution log, retrieve the running platform type and the running device host name of the running platform component type in the fault point task identifier;

根据设备主机名,确定所述故障点任务标识对应任务节点的设备监控信息。According to the host name of the device, determine the device monitoring information of the task node corresponding to the fault point task identifier.

本发明实施例所提供的任务运行监控信息获取方法和装置,根据任务所处的各任务节点的运行信息,分别设置任务标识;所述方法还包括:根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息、和/或平台监控信息、和/或设备监控信息,确定故障信息。如此,能通过故障点任务节点的任务标识,准确获取所述故障点任务节点的任务监控信息、平台监控信息和设备监控信息;在出现故障时能够根据故障节点的任务标识获取故障监控信息,快速定位故障,实现故障影响的自动化分析,提高故障解决及时性。The method and device for obtaining task operation monitoring information provided by the embodiments of the present invention set task identifiers respectively according to the operation information of each task node where the task is located; the method further includes: obtaining The task monitoring information of the task node, and/or platform monitoring information, and/or equipment monitoring information corresponding to the task identification content of the failure point determines the failure information. In this way, the task monitoring information, platform monitoring information and equipment monitoring information of the task node at the fault point can be accurately obtained through the task identification of the task node at the fault point; when a fault occurs, the fault monitoring information can be obtained according to the task identification of the fault node, and quickly Locate faults, realize automatic analysis of fault impacts, and improve the timeliness of fault resolution.

附图说明Description of drawings

图1为本发明实施例任务运行监控信息获取方法的流程示意图;1 is a schematic flow diagram of a method for obtaining task operation monitoring information according to an embodiment of the present invention;

图2为本发明实施例任务标识组成示意图;Fig. 2 is a schematic diagram of the composition of the task identification of the embodiment of the present invention;

图3为本发明实施例任务标识实现故障定位流程顺序示意图;Fig. 3 is a schematic diagram of the fault location process sequence diagram of the task identification implementation of the embodiment of the present invention;

图4为本发明实施例任务标识关联原理示意图;Fig. 4 is a schematic diagram of the principle of task identification association according to the embodiment of the present invention;

图5为本发明实施例应用实例业务流程示意图;Fig. 5 is a schematic diagram of the business process of the application example of the embodiment of the present invention;

图6为本发明实施例应用实例业务流程运行记录示意图;FIG. 6 is a schematic diagram of a business process operation record of an application example in an embodiment of the present invention;

图7为本发明实施例应用实例业务任务标识运行结果示意图;Fig. 7 is a schematic diagram of the operation result of the business task identification of the application instance of the embodiment of the present invention;

图8为本发明实施例任务运行监控信息获取装置的组成结构示意图。FIG. 8 is a schematic diagram of the composition and structure of an apparatus for acquiring task operation monitoring information according to an embodiment of the present invention.

具体实施方式detailed description

本发明实施例中,根据任务所处的各任务节点的运行信息,分别设置任务标识;所述方法还包括:根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息、和/或平台监控信息、和/或设备监控信息,确定故障信息。In the embodiment of the present invention, task identifiers are respectively set according to the operation information of each task node where the task is located; the method further includes: according to the task identifier of the fault point, obtaining the information corresponding to the content of the task identifier of the fault point from the monitoring information The task monitoring information of the task node, and/or the platform monitoring information, and/or the equipment monitoring information determine the fault information.

下面结合实施例对本发明再作进一步详细的说明。The present invention will be described in further detail below in conjunction with the examples.

本发明实施例提供的一种任务运行监控信息获取方法,如图1所示,所述方法包括:A method for obtaining task operation monitoring information provided by an embodiment of the present invention, as shown in FIG. 1 , the method includes:

步骤101:根据任务所处的各任务节点的运行信息,分别设置任务标识;Step 101: according to the operation information of each task node where the task is located, respectively set the task identifier;

通常,一个大数据应用包含多个任务,大数据应用与任务的关系是一对多的关系,大数据应用与任务的关联关系由系统进行维护;单个任务也称为作业流程,单个作业流程由多个任务节点组成;现有技术采取的方法是给每个任务分配一个与任务运行信息无关的任务标识或任务序列号,用于跟踪任务的执行状况,其缺点是无法精确获取到各个任务节点的执行状况,也无法直接从中获取运行信息;Usually, a big data application contains multiple tasks. The relationship between big data applications and tasks is one-to-many, and the relationship between big data applications and tasks is maintained by the system; a single task is also called a job process, and a single job process is determined by Composed of multiple task nodes; the method adopted in the prior art is to assign each task a task identification or task sequence number that has nothing to do with task operation information, and is used to track the execution status of the task. The disadvantage is that it is impossible to accurately obtain each task node execution status, and running information cannot be obtained directly from it;

本发明的技术方案在任务运行过程中的每个任务节点设置不同的任务标识,所述任务标识包含任务在该任务节点的运行信息,所述运行信息包括:运行平台类型、运行平台组件类型、和/或任务序列号;其中,运行平台类型是指任务运行所处节点的运行平台的类型,如Java,Storm、Hadoop、Spark等平台;运行平台组件类型是指任务运行所处任务节点的运行平台的组件的类型,如Java平台中的Java-Node组件;任务序列号是任务运行前分配给所述任务的唯一序列号。In the technical solution of the present invention, different task identifiers are set for each task node in the task running process, and the task identifier includes the operation information of the task at the task node, and the operation information includes: the type of operating platform, the type of operating platform components, and/or task sequence number; wherein, the running platform type refers to the type of running platform of the node where the task runs, such as Java, Storm, Hadoop, Spark and other platforms; the running platform component type refers to the running platform of the task node where the task runs The type of the component of the platform, such as the Java-Node component in the Java platform; the task sequence number is the unique sequence number assigned to the task before the task runs.

实际应用中,任务标识的形式可以如图2所示,这里,任务标识还可以包括:任务类型和任务名称,用于更快更方便地识别任务;根据不同的具体任务组件类型,按照图2的形式,任务标识可以如下设置:In practical applications, the form of the task identification can be shown in Figure 2. Here, the task identification can also include: task type and task name for faster and more convenient identification of tasks; according to different specific task component types, according to Figure 2 In the form of , the task ID can be set as follows:

对于同步任务组件,直接运行在oozie服务端,只有成功或失败信息,没有特别明显的信息,因此,任务标识可以设为:oozie:none;For the synchronous task component, it runs directly on the oozie server, only success or failure information, no particularly obvious information, therefore, the task ID can be set to: oozie: none;

对于单映射(map)/回归(Reduce)任务组件,会提交一个单映射(only map)的mapreduce任务触发运行,在该任务中,map中封装用户定义的action组件,任务的任务标识可以定义为:oozie:lancher:T={0}:W={1}:A={2}:ID={3};其中,0表示组件类型,1表示平台类型,2表示任务名称,3表示任务序列号,如此,任务表示体现任务与平台的关系;For the single-map (map)/regression (Reduce) task component, a single-map (only map) mapreduce task will be submitted to trigger the operation. In this task, the user-defined action component is encapsulated in the map, and the task ID of the task can be defined as :oozie:lancher:T={0}:W={1}:A={2}:ID={3}; where 0 means component type, 1 means platform type, 2 means task name, 3 means task sequence No. In this way, the task representation reflects the relationship between the task and the platform;

对于双map/Reduce任务组件,在单map/Reduce任务运行的基础上,其map中封装用户定义的action是具有mapreduce性质的作业,而这种性质的mapreduce作业的任务标识可以定义为:oozie:action:T={0}:W={1}:A={2}:ID={3},其中,0表示组件类型,1表示平台类型,2表示任务名称,3表示任务序列号,如此,任务表示体现任务与平台的关系;For a dual map/reduce task component, based on the operation of a single map/reduce task, the user-defined action encapsulated in the map is a mapreduce job, and the task ID of a mapreduce job of this nature can be defined as: oozie: action: T={0}:W={1}:A={2}:ID={3}, where 0 represents the component type, 1 represents the platform type, 2 represents the task name, 3 represents the task sequence number, and so on , the task representation reflects the relationship between the task and the platform;

这里,单map/Reduce任务组件和双map/Reduce任务组件称为异步任务组件。Here, the single map/reduce task component and the dual map/reduce task component are called asynchronous task components.

步骤102:根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息、和/或平台监控信息、和/或设备监控信息,确定故障信息;Step 102: Obtain task monitoring information, and/or platform monitoring information, and/or equipment monitoring information of the task node corresponding to the content of the fault point task ID in the monitoring information according to the fault point task ID, and determine the fault information;

现有的大数据的监控信息包括:应用监控信息、任务监控信息、平台监控信息、设备监控信息;各监控信息包含了任务在运行过程中的各种运行信息;其中,任务监控信息、平台监控信息、设备监控信息是相互独立的,互不关联;大数据应用与任务的关联关系由系统进行维护,因此,任务监控信息同应用监控信息可以通过任务归属关系完成关联。任务监控信息、平台监控信息和设备监控信息三者可以通过本发明技术方案的任务标识来实现关联,从而将所述应用监控信息、任务监控信息、平台监控信息、设备监控信息进行关联;其中,所述平台监控信息,包括:平台名称、平台类型、平台状态、平台上任务执行状况与日志;所述任务监控信息,包括:任务流转信息、流转环节、当前环节、各环节所用时间、节点转改、任务输出日志;所述设备监控信息包括:设备平台、设备主机信息、设备主机运行状况;通过本发明技术方案的任务标识实现任务监控信息、平台监控信息和设备监控信息三者关联,并可以将关联的信息合并到应用监控信息中,这样,所述应用监控信息能提供如下信息:应用包含的任务,每个任务的执行情况、任务在平台运行的情况、任务所在设备主机的运行状况;如此,可以获取整个应用的各任务,在各任务节点运行的信息;Existing big data monitoring information includes: application monitoring information, task monitoring information, platform monitoring information, and equipment monitoring information; each monitoring information includes various operating information during the operation of the task; among them, task monitoring information, platform monitoring information Information and equipment monitoring information are independent of each other and not related to each other; the relationship between big data applications and tasks is maintained by the system. Therefore, task monitoring information and application monitoring information can be associated through task affiliation. Task monitoring information, platform monitoring information, and equipment monitoring information can be associated through the task identification of the technical solution of the present invention, thereby associating the application monitoring information, task monitoring information, platform monitoring information, and equipment monitoring information; wherein, The platform monitoring information includes: platform name, platform type, platform status, task execution status and logs on the platform; the task monitoring information includes: task transfer information, transfer link, current link, time spent in each link, node transfer Change, task output log; the equipment monitoring information includes: equipment platform, equipment host information, equipment host operating status; realize the association of task monitoring information, platform monitoring information and equipment monitoring information through the task identification of the technical solution of the present invention, and The associated information can be merged into the application monitoring information. In this way, the application monitoring information can provide the following information: tasks contained in the application, the execution status of each task, the running status of the task on the platform, and the running status of the host device where the task is located. ;In this way, each task of the entire application can be obtained, and the information running on each task node;

具体的,任务运行前,可以为所述任务分配一个唯一的任务序列号,任务监控信息可以与所述任务序列号相对应,如:可以以任务序列号命名所述任务监控信息,如此,通过任务标识中的任务序列号就可以获取对应的任务监控信息;也可以在序列号后面增加任务节点特有标识,并按各任务节点的序列号分别建立任务监控信息,如此,通过任务标识中的任务序列号就可以获取对应的任务节点任务监控信息;在大数据任务运行过程中,如果出现故障,可以通过故障任务节点的任务标识获取对应故障任务节点的任务监控信息。Specifically, before the task runs, a unique task sequence number can be assigned to the task, and the task monitoring information can correspond to the task sequence number, such as: the task monitoring information can be named after the task sequence number, so, through The task serial number in the task identification can obtain the corresponding task monitoring information; you can also add the unique identification of the task node after the serial number, and establish the task monitoring information according to the serial number of each task node. In this way, through the task identification in the task identification The serial number can obtain the task monitoring information of the corresponding task node; during the operation of the big data task, if a fault occurs, the task monitoring information of the corresponding faulty task node can be obtained through the task ID of the faulty task node.

通过任务标识中的运行平台类型和运行平台组件类型,可以确定当前任务运行的任务节点所处平台类型和平台组件类型上;通常,平台监控信息以平台类型和平台组件类型进行归类,如此,可以通过平台类型和平台组件类型关联出所述任务节点的平台监控信息;在大数据任务运行过程中,如果出现故障,可以通过故障任务节点的任务标识中的平台类型和平台组件类型,关联出所述故障任务节点的平台监控信息。Through the running platform type and running platform component type in the task identifier, it can be determined that the task node running the current task is located on the platform type and platform component type; usually, platform monitoring information is classified by platform type and platform component type, so, The platform monitoring information of the task node can be associated through the platform type and platform component type; during the operation of the big data task, if a fault occurs, the platform type and platform component type in the task identifier of the faulty task node can be associated to obtain Platform monitoring information of the faulty task node.

通过任务标识中的运行平台类型和运行平台组件类型,可以在所述平台监控信息的任务执行日志中检索到运行所述平台类型和组件类型的设备的主机名称;通过设备的主机名可以在设备监控信息中检索到所述任务节点的设备监控信息;在大数据任务运行过程中,如果出现故障,可以通过故障任务节点的任务标识中的平台类型和平台组件类型,在任务执行日志中确定出设备主机名,进一步获取任务节点的设备监控信息。Through the running platform type and running platform component type in the task identification, the host name of the device running the platform type and component type can be retrieved in the task execution log of the platform monitoring information; The device monitoring information of the task node is retrieved from the monitoring information; during the operation of the big data task, if a fault occurs, the platform type and platform component type in the task identifier of the faulty task node can be used to determine the fault in the task execution log. The host name of the device to further obtain the device monitoring information of the task node.

如此,应用监控信息、任务监控信息、平台监控信息、设备监控信息通过任务标识,完成了监控信息的关联,达到几种监控信息的无缝融合;在实际应用中,通过这种融合关系,在日常的维护保障的时候,可以建立一个用户界面,收集各监控信息的关联信息,使关联的应用监控信息、任务监控信息、平台监控信息、设备监控信息等信息同时提供给维护人员,直接获取应用包含任务在各任务节点的各种监控信息,方便运营保障和故障定位;在发生故障时可以通过故障点任务标识,确定故障涉及的平台类型,平台组件类型或故障设备;大大提高故障定位的效率。In this way, application monitoring information, task monitoring information, platform monitoring information, and equipment monitoring information complete the association of monitoring information through task identification, and achieve seamless integration of several types of monitoring information; in practical applications, through this fusion relationship, in During daily maintenance and guarantee, a user interface can be established to collect the related information of each monitoring information, so that the related application monitoring information, task monitoring information, platform monitoring information, equipment monitoring information and other information can be provided to the maintenance personnel at the same time, and the application monitoring information can be obtained directly. Contains various monitoring information of tasks at each task node, which is convenient for operation guarantee and fault location; when a fault occurs, the fault point task identification can be used to determine the platform type, platform component type or faulty equipment involved in the fault; greatly improve the efficiency of fault location .

下面结合实施例1对本发明作进一步详细的说明。Below in conjunction with embodiment 1, the present invention is described in further detail.

如图3所示,这里a和b分别展示了同步任务组件和异步任务组件在通过任务标识实现故障定位流程顺序;其中,任务标识将任务、平台、设备、应用关联起来的原理如图4所示:一个应用包含多个任务,应用与任务的关系是一对多的关系,此关联关系由系统进行维护;单个任务又指作业流程,作业流程由多个任务节点组成,每个任务节点在流程每次运行时,设置一个唯一的任务标识;通过所述任务标识,与平台上运行的任务相关联;平台作业在具体的设备上执行中,平台作业日志或作业状态又可以与相应的设备信息关联。这样,任务标识就以此来完成任务、平台、设备、应用关联等能力;进一步,在发生故障时,可以通过任务日志,完成故障定位、故障分析、故障监控等业务处理。As shown in Figure 3, here a and b respectively show the sequence of the fault location process implemented by the synchronous task component and the asynchronous task component through the task identifier; among them, the principle of associating tasks, platforms, devices, and applications with the task identifier is shown in Figure 4 Shown: An application contains multiple tasks, and the relationship between applications and tasks is a one-to-many relationship, which is maintained by the system; a single task refers to a job process, and a job process consists of multiple task nodes, and each task node is in the Each time the process runs, a unique task identifier is set; through the task identifier, it is associated with the task running on the platform; when the platform job is executing on a specific device, the platform job log or job status can be associated with the corresponding device information association. In this way, the task identification is used to complete tasks, platforms, equipment, application associations and other capabilities; further, when a fault occurs, the task log can be used to complete business processes such as fault location, fault analysis, and fault monitoring.

下面结合实施例2对本发明作更进一步详细的说明。Below in conjunction with embodiment 2 the present invention is described in further detail.

本实施例在具体业务应用的各任务的任务节点设置了任务标识,对具体业务应用中的任务运行起到了良好的监控的效果;In this embodiment, task identifiers are set on the task nodes of each task in the specific business application, which has a good monitoring effect on the task operation in the specific business application;

这里,具体业务应用实现的功能是:计算分析用户交往圈行为,使用用户语音详单来发现与分析用户的交往圈,通过判断用户和其他用户的通话行为,例如通话频次、通话时长、通话时间段等指标来分析用户的交往行为,判断是否为交往圈影响力最高的用户;具体业务流程为如图5所示,包括:Here, the functions realized by the specific business application are: calculate and analyze the behavior of the user's social circle, use the user's voice detailed list to discover and analyze the user's social circle, and judge the call behavior between the user and other users, such as call frequency, call duration, call time To analyze the user's communication behavior by segment and other indicators, and judge whether it is the user with the highest influence in the communication circle; the specific business process is shown in Figure 5, including:

步骤501:从接口机采集详单文件到Hadoop的分布式文件系统(HDFS,Hadoop Distributed File System)中;Step 501: collect the detailed list file from the interface machine in the distributed file system (HDFS, Hadoop Distributed File System) of Hadoop;

步骤502:用Map/Reduce程序对详单进行清洗、过滤与分拣;Step 502: Use the Map/Reduce program to clean, filter and sort the detailed list;

步骤503:步骤502的结果入Hive库,按主叫号码与被叫号码进行汇总,计算通话次数、通话时长、通话时间段等指标;Step 503: The result of step 502 is entered into the Hive library, and is summarized according to the calling number and the called number, and indicators such as the number of calls, the duration of the call, and the time period of the call are calculated;

步骤504:将分析结果通过sqoop脚本导出到关系型数据库中。Step 504: Export the analysis results to the relational database through the sqoop script.

在所述业务流程中采用了本发明的技术方案,在各节点设置任务标识,业务流程运行记录如图6所示,业务流程节点状态如图7(a)所示,业务流程运行日志如图7(b)所示;业务流程节点中的任务标识在平台中按名称对应情况如图7(c)所示;业务流程节点中的任务标识在平台中对应的作业运行状态如图7(d)所示;业务流程节点中的任务标识在平台中的运行情况与对应设备情况如图7(e)所示;In the business process, the technical solution of the present invention is adopted, task identification is set at each node, the business process operation record is shown in Figure 6, the business process node status is shown in Figure 7(a), and the business process operation log is shown in Figure 7(a). 7(b); the corresponding status of the task IDs in the business process nodes by name in the platform is shown in Figure 7(c); the corresponding job running status of the task IDs in the business process nodes in the platform is shown in Figure 7(d) ) as shown in FIG. 7(e);

通过图7可以看出在整个业务流程中,任务节点中的任务监控信息、平台监控信息、设备监控信息已经通过任务标识完成了关联;如此,维护人员可以方便地获取所需信息;在发生故障时可以方便地确定故障发生的平台,平台组件或者设备等信息。It can be seen from Figure 7 that in the entire business process, the task monitoring information, platform monitoring information, and equipment monitoring information in the task node have been associated through the task identification; in this way, maintenance personnel can easily obtain the required information; You can easily determine the platform where the fault occurred, platform components or equipment and other information.

本发明实施例提供的一种任务运行监控信息获取装置,如图8所示,所述装置包括:设置模块81、确定模块82、其中,An embodiment of the present invention provides a task operation monitoring information acquisition device, as shown in FIG. 8 , the device includes: a setting module 81, a determination module 82, wherein,

所述设置模块81,用于根据任务所处的各任务节点的运行信息,分别设置任务标识;The setting module 81 is configured to set task identifiers respectively according to the operation information of each task node where the task is located;

通常,一个大数据应用包含多个任务,大数据应用与任务的关系是一对多的关系,大数据应用与任务的关联关系由系统进行维护;单个任务也称为作业流程,单个作业流程由多个任务节点组成;现有技术采取的方法是给每个任务分配一个与任务运行信息无关的任务标识或任务序列号,用于跟踪任务的执行状况,其缺点是无法精确获取到各个任务节点的执行状况,也无法直接从中获取运行信息;Usually, a big data application contains multiple tasks. The relationship between big data applications and tasks is one-to-many, and the relationship between big data applications and tasks is maintained by the system; a single task is also called a job process, and a single job process is determined by Composed of multiple task nodes; the method adopted in the prior art is to assign each task a task identification or task sequence number that has nothing to do with task operation information, and is used to track the execution status of the task. The disadvantage is that it is impossible to accurately obtain each task node execution status, and running information cannot be obtained directly from it;

本发明的技术方案在任务运行过程中的每个任务节点设置不同的任务标识,所述任务标识包含任务在该任务节点的运行信息,所述运行信息包括:运行平台类型、运行平台组件类型、和/或任务序列号;其中,运行平台类型是指任务运行所处任务节点的运行平台的类型,如Java,Storm、Hadoop、Spark等平台;运行平台组件类型是指任务运行所处任务节点的运行平台的组件的类型,如Java平台中的Java-Node组件;任务序列号是任务运行前分配给所述任务的唯一序列号。In the technical solution of the present invention, different task identifiers are set for each task node in the task running process, and the task identifier includes the operation information of the task at the task node, and the operation information includes: the type of operating platform, the type of operating platform components, and/or task sequence number; where, the running platform type refers to the type of running platform of the task node where the task runs, such as Java, Storm, Hadoop, Spark and other platforms; the running platform component type refers to the type of the task node where the task runs The component type of the running platform, such as the Java-Node component in the Java platform; the task sequence number is the unique sequence number assigned to the task before the task runs.

实际应用中,任务标识的形式可以如图2所示,这里,任务标识还可以包括:任务类型和任务名称,用于更快更方便地识别任务;根据不同的具体任务组件类型,按照图2的形式,任务标识可以如下设置:In practical applications, the form of the task identification can be shown in Figure 2. Here, the task identification can also include: task type and task name for faster and more convenient identification of tasks; according to different specific task component types, according to Figure 2 In the form of , the task ID can be set as follows:

对于同步任务组件,直接运行在oozie服务端,只有成功或失败信息,没有特别明显的信息,因此,任务标识可以设为:oozie:none;For the synchronous task component, it runs directly on the oozie server, only success or failure information, no particularly obvious information, therefore, the task ID can be set to: oozie: none;

对于单映射(map)/回归(Reduce)任务组件,会提交一个单映射(only map)的mapreduce任务触发运行,在该任务中,map中封装用户定义的action组件,任务的任务标识可以定义为:oozie:lancher:T={0}:W={1}:A={2}:ID={3};其中,0表示组件类型,1表示平台类型,2表示任务名称,3表示任务序列号,如此,任务表示体现任务与平台的关系;For the single-map (map)/regression (Reduce) task component, a single-map (only map) mapreduce task will be submitted to trigger the operation. In this task, the user-defined action component is encapsulated in the map, and the task ID of the task can be defined as :oozie:lancher:T={0}:W={1}:A={2}:ID={3}; where 0 means component type, 1 means platform type, 2 means task name, 3 means task sequence No. In this way, the task representation reflects the relationship between the task and the platform;

对于双map/Reduce任务组件,在单map/Reduce任务运行的基础上,其map中封装用户定义的action是具有mapreduce性质的作业,而这种性质的mapreduce作业的任务标识可以定义为:oozie:action:T={0}:W={1}:A={2}:ID={3},其中,0表示组件类型,1表示平台类型,2表示任务名称,3表示任务序列号,如此,任务表示体现任务与平台的关系;For a dual map/reduce task component, based on the operation of a single map/reduce task, the user-defined action encapsulated in the map is a mapreduce job, and the task ID of a mapreduce job of this nature can be defined as: oozie: action: T={0}:W={1}:A={2}:ID={3}, where 0 represents the component type, 1 represents the platform type, 2 represents the task name, 3 represents the task sequence number, and so on , the task representation reflects the relationship between the task and the platform;

这里,单map/Reduce任务组件和双map/Reduce任务组件称为异步任务组件。Here, the single map/reduce task component and the dual map/reduce task component are called asynchronous task components.

所述确定模块82,根据各个任务标识,在监控信息中获取与所述任务标识内容对应的各任务节点任务运行的监控信息;The determination module 82 obtains, from the monitoring information, monitoring information on the task operation of each task node corresponding to the content of the task identification according to each task identification;

现有的大数据的监控信息包括:应用监控信息、任务监控信息、平台监控信息、设备监控信息;各监控信息包含了任务在运行过程中的各种运行信息;其中,任务监控信息、平台监控信息、设备监控信息是相互独立的,互不关联;大数据应用与任务的关联关系由系统进行维护,因此,任务监控信息同应用监控信息可以通过任务归属关系完成关联。任务监控信息、平台监控信息和设备监控信息三者可以通过本发明技术方案的任务标识来实现关联,从而将所述应用监控信息、任务监控信息、平台监控信息、设备监控信息进行关联;其中,所述平台监控信息,包括:平台名称、平台类型、平台状态、平台上任务执行状况与日志;所述任务监控信息,包括:任务流转信息、流转环节、当前环节、各环节所用时间、节点转改、任务输出日志;所述设备监控信息包括:设备平台、设备主机信息、设备主机运行状况;通过本发明技术方案的任务标识实现任务监控信息、平台监控信息和设备监控信息三者关联,并可以将关联的信息合并到应用监控信息中,这样,所述应用监控信息能提供如下信息:应用包含的任务,每个任务的执行情况、任务在平台运行的情况、任务所在设备主机的运行状况;如此,可以获取整个应用的各任务,在各任务节点运行的信息;Existing big data monitoring information includes: application monitoring information, task monitoring information, platform monitoring information, and equipment monitoring information; each monitoring information includes various operating information during the operation of the task; among them, task monitoring information, platform monitoring information Information and equipment monitoring information are independent of each other and not related to each other; the relationship between big data applications and tasks is maintained by the system. Therefore, task monitoring information and application monitoring information can be associated through task affiliation. Task monitoring information, platform monitoring information, and equipment monitoring information can be associated through the task identification of the technical solution of the present invention, thereby associating the application monitoring information, task monitoring information, platform monitoring information, and equipment monitoring information; wherein, The platform monitoring information includes: platform name, platform type, platform status, task execution status and logs on the platform; the task monitoring information includes: task transfer information, transfer link, current link, time spent in each link, node transfer Change, task output log; the equipment monitoring information includes: equipment platform, equipment host information, equipment host operating status; realize the association of task monitoring information, platform monitoring information and equipment monitoring information through the task identification of the technical solution of the present invention, and The related information can be merged into the application monitoring information. In this way, the application monitoring information can provide the following information: tasks contained in the application, the execution status of each task, the running status of the task on the platform, and the running status of the host device where the task is located. ;In this way, each task of the entire application can be obtained, and the information running on each task node;

具体的,任务运行前,可以为所述任务分配一个唯一的任务序列号,任务监控信息可以与所述任务序列号相对应,如:可以以任务序列号命名所述任务监控信息,如此,通过任务标识中的任务序列号就可以获取对应的任务监控信息;也可以在序列号后面增加任务节点特有标识,并按各任务节点的序列号分别建立任务监控信息,如此,通过任务标识中的任务序列号就可以获取对应的任务节点任务监控信息;在大数据任务运行过程中,如果出现故障,可以通过故障任务节点的任务标识获取对应故障任务节点的任务监控信息。Specifically, before the task runs, a unique task sequence number can be assigned to the task, and the task monitoring information can correspond to the task sequence number, such as: the task monitoring information can be named after the task sequence number, so, through The task serial number in the task identification can obtain the corresponding task monitoring information; you can also add the unique identification of the task node after the serial number, and establish the task monitoring information according to the serial number of each task node. In this way, through the task identification in the task identification The serial number can obtain the task monitoring information of the corresponding task node; during the operation of the big data task, if a fault occurs, the task monitoring information of the corresponding faulty task node can be obtained through the task ID of the faulty task node.

通过任务标识中的运行平台类型和运行平台组件类型,可以确定当前任务运行的任务节点所处平台类型和平台组件类型上;通常,平台监控信息以平台类型和平台组件类型进行归类,如此,可以通过平台类型和平台组件类型关联出所述任务节点的平台监控信息;在大数据任务运行过程中,如果出现故障,可以通过故障任务节点的任务标识中的平台类型和平台组件类型,关联出所述故障任务节点的平台监控信息。Through the running platform type and running platform component type in the task identifier, it can be determined that the task node running the current task is located on the platform type and platform component type; usually, platform monitoring information is classified by platform type and platform component type, so, The platform monitoring information of the task node can be associated through the platform type and platform component type; during the operation of the big data task, if a fault occurs, the platform type and platform component type in the task identifier of the faulty task node can be associated to obtain Platform monitoring information of the faulty task node.

通过任务标识中的运行平台类型和运行平台组件类型,可以在所述平台监控信息的任务执行日志中检索到运行所述平台类型和组件类型的设备的主机名称;通过设备的主机名可以在设备监控信息中检索到所述任务节点的设备监控信息;在大数据任务运行过程中,如果出现故障,可以通过故障任务节点的任务标识中的平台类型和平台组件类型,在任务执行日志中确定出设备主机名,进一步获取任务节点的设备监控信息。Through the running platform type and running platform component type in the task identification, the host name of the device running the platform type and component type can be retrieved in the task execution log of the platform monitoring information; The device monitoring information of the task node is retrieved from the monitoring information; during the operation of the big data task, if a fault occurs, the platform type and platform component type in the task identification of the faulty task node can be used to determine the fault in the task execution log. The host name of the device to further obtain the device monitoring information of the task node.

如此,应用监控信息、任务监控信息、平台监控信息、设备监控信息通过任务标识,完成了监控信息的关联,达到几种监控信息的无缝融合;在实际应用中,通过这种融合关系,在日常的维护保障的时候,可以建立一个用户界面,收集各监控信息的关联信息,使关联的应用监控信息、任务监控信息、平台监控信息、设备监控信息等信息同时提供给维护人员,直接获取应用包含任务在各任务节点的各种监控信息,方便运营保障和故障定位;在发生故障时可以通过故障点任务标识,确定故障涉及的平台类型,平台组件类型或故障设备;大大提高故障定位的效率。In this way, application monitoring information, task monitoring information, platform monitoring information, and equipment monitoring information complete the association of monitoring information through task identification and achieve seamless integration of several types of monitoring information; in practical applications, through this fusion relationship, in During daily maintenance and guarantee, a user interface can be established to collect the related information of each monitoring information, so that the related application monitoring information, task monitoring information, platform monitoring information, equipment monitoring information and other information can be provided to the maintenance personnel at the same time, and the application monitoring information can be obtained directly. Contains various monitoring information of tasks at each task node, which is convenient for operation guarantee and fault location; when a fault occurs, the fault point task identification can be used to determine the platform type, platform component type or faulty equipment involved in the fault; greatly improve the efficiency of fault location .

在实际应用中,设置模块81、确定模块82可由大数据服务器系统的中央处理器(CPU)、微处理器(MPU)、数字信号处理器(DSP)、或现场可编程门阵列(FPGA)等实现。In practical applications, the setting module 81 and the determining module 82 can be composed of a central processing unit (CPU), a microprocessor (MPU), a digital signal processor (DSP), or a field programmable gate array (FPGA) of the big data server system. accomplish.

以上所述,仅为本发明的佳实施例而已,并非用于限定本发明的保护范围,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above is only a preferred embodiment of the present invention, and is not used to limit the scope of protection of the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in this document. within the scope of protection of the invention.

Claims (10)

1.一种任务运行监控信息获取方法,其特征在于,所述方法包括:根据任务所处的各任务节点的运行信息,分别设置任务标识;所述方法还包括:1. A task operation monitoring information acquisition method is characterized in that, the method includes: according to the operation information of each task node where the task is located, the task identification is respectively set; the method also includes: 根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息、和/或平台监控信息、和/或设备监控信息,确定故障信息;Obtain task monitoring information, and/or platform monitoring information, and/or equipment monitoring information of the task node corresponding to the content of the fault point task identification in the monitoring information according to the fault point task identification, and determine the fault information; 所述任务标识的内容包括:所述运行平台类型、和/或所述运行平台组件类型、和/或任务序列号。The content of the task identification includes: the operating platform type, and/or the operating platform component type, and/or the task sequence number. 2.根据权利要求1所述的方法,其特征在于,2. The method of claim 1, wherein, 所述运行信息包括:运行平台类型、和/或运行平台组件类型;The running information includes: running platform type, and/or running platform component type; 所述任务序列号包括:根据所述任务预设的唯一的标识号,或根据所述任务节点预设的唯一的标识号。The task sequence number includes: a unique identification number preset according to the task, or a unique identification number preset according to the task node. 3.根据权利要求2所述的方法,其特征在于,所述根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息,确定故障信息;包括:3. The method according to claim 2, wherein, according to the fault point task identification, the task monitoring information of the task node corresponding to the fault point task identification content is obtained in the monitoring information, and the fault information is determined; comprising : 预先关联所述任务序列号与所述任务监控信息;Associating the task sequence number with the task monitoring information in advance; 根据所述故障点任务标识中的任务序列号,确定所述故障点任务标识对应的任务节点的任务监控信息。According to the task sequence number in the task identifier of the failure point, determine the task monitoring information of the task node corresponding to the task identifier of the failure point. 4.根据权利要求2所述的方法,其特征在于,所述根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的平台监控信息,确定故障信息;包括:4. The method according to claim 2, wherein, according to the fault point task identification, the platform monitoring information of the task node corresponding to the fault point task identification content is obtained in the monitoring information, and the fault information is determined; comprising : 根据所述故障点任务标识中的运行平台类型和运行平台组件类型,确定所述故障点任务标识对应的任务节点的平台类型和平台组件类型;Determine the platform type and platform component type of the task node corresponding to the fault point task ID according to the operating platform type and operating platform component type in the fault point task ID; 根据所述故障点任务标识对应的任务节点的平台类型和平台组件类型,确定所述故障点任务标识对应的任务节点的平台类型和平台组件类型对应的所述平台监控信息。According to the platform type and platform component type of the task node corresponding to the fault point task identifier, determine the platform monitoring information corresponding to the platform type of the task node corresponding to the fault point task identifier and the platform component type. 5.根据权利要求2所述的方法,其特征在于,所述根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的设备监控信息,确定故障信息;包括:5. The method according to claim 2, wherein, according to the task identification of the failure point, the device monitoring information of the task node corresponding to the content of the task identification of the failure point is obtained in the monitoring information, and the failure information is determined; : 在任务执行日志中检索所述故障点任务标识中的运行平台类型和运行平台组件类型运行的设备主机名;In the task execution log, retrieve the running platform type and the running device host name of the running platform component type in the fault point task identifier; 根据设备主机名,确定所述故障点任务标识对应任务节点的设备监控信息。According to the host name of the device, determine the device monitoring information of the task node corresponding to the fault point task identifier. 6.一种任务运行监控信息获取装置,其特征在于,所述装置包括:设置装置和确定装置,其中,6. A task operation monitoring information acquisition device, characterized in that the device comprises: a setting device and a determining device, wherein, 所述设置装置,用于根据任务所处的各任务节点的运行信息,分别设置任务标识;The setting device is used to respectively set task identifiers according to the operation information of each task node where the task is located; 所述确定装置,根据故障点任务标识,在监控信息中获取与所述故障点任务标识内容对应的任务节点的任务监控信息、和/或平台监控信息、和/或设备监控信息,确定故障信息;The determining means obtains task monitoring information, and/or platform monitoring information, and/or equipment monitoring information of the task node corresponding to the content of the fault point task ID in the monitoring information according to the fault point task ID, and determines the fault information ; 所述任务标识的内容包括:所述运行平台类型、和/或所述运行平台组件类型、和/或任务序列号。The content of the task identification includes: the operating platform type, and/or the operating platform component type, and/or the task sequence number. 7.根据权利要求6所述的装置,其特征在于,7. The device of claim 6, wherein: 所述运行信息包括:运行平台类型、和/或运行平台组件类型;The running information includes: running platform type, and/or running platform component type; 所述任务序列号包括:根据所述任务预设的唯一的标识号,或根据所述任务节点预设的唯一的标识号。The task sequence number includes: a unique identification number preset according to the task, or a unique identification number preset according to the task node. 8.根据权利要求7所述的装置,其特征在于,所述确定装置具体用于:8. The device according to claim 7, wherein the determining device is specifically used for: 预先关联所述任务序列号与所述任务监控信息;Associating the task sequence number with the task monitoring information in advance; 所述确定故障信息,包括:根据所述故障点任务标识中的任务序列号,确定所述故障点任务标识对应的任务节点的任务监控信息。The determining the failure information includes: according to the task sequence number in the task identifier of the failure point, determining the task monitoring information of the task node corresponding to the task identifier of the failure point. 9.根据权利要求7所述的装置,其特征在于,所述确定装置具体用于:9. The device according to claim 7, wherein the determining device is specifically used for: 根据所述故障点任务标识中的运行平台类型和运行平台组件类型,确定所述故障点任务标识对应的任务节点的平台类型和平台组件类型;Determine the platform type and platform component type of the task node corresponding to the fault point task ID according to the operating platform type and operating platform component type in the fault point task ID; 根据所述故障点任务标识对应的任务节点的平台类型和平台组件类型,确定所述故障点任务标识对应的任务节点的平台类型和平台组件类型对应的所述平台监控信息。According to the platform type and platform component type of the task node corresponding to the fault point task identifier, determine the platform monitoring information corresponding to the platform type of the task node corresponding to the fault point task identifier and the platform component type. 10.根据权利要求7所述的装置,其特征在于,所述确定装置具体用于:10. The device according to claim 7, wherein the determining device is specifically used for: 在任务执行日志中检索所述故障点任务标识中的运行平台类型和运行平台组件类型运行的设备主机名;In the task execution log, retrieve the running platform type and the running device host name of the running platform component type in the fault point task identifier; 根据设备主机名,确定所述故障点任务标识对应任务节点的设备监控信息。According to the host name of the device, determine the device monitoring information of the task node corresponding to the fault point task identifier.
CN201610158804.3A 2016-03-18 2016-03-18 Task operation monitoring information acquisition method and device Active CN107204868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610158804.3A CN107204868B (en) 2016-03-18 2016-03-18 Task operation monitoring information acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610158804.3A CN107204868B (en) 2016-03-18 2016-03-18 Task operation monitoring information acquisition method and device

Publications (2)

Publication Number Publication Date
CN107204868A true CN107204868A (en) 2017-09-26
CN107204868B CN107204868B (en) 2020-08-18

Family

ID=59904279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610158804.3A Active CN107204868B (en) 2016-03-18 2016-03-18 Task operation monitoring information acquisition method and device

Country Status (1)

Country Link
CN (1) CN107204868B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776579A (en) * 2018-06-19 2018-11-09 郑州云海信息技术有限公司 A kind of distributed storage cluster expansion method, device, equipment and storage medium
CN109471709A (en) * 2018-10-16 2019-03-15 深圳中顺易金融服务有限公司 The dispatching method of flow tasks based on Apache Oozie frame processing big data
CN110209893A (en) * 2019-04-23 2019-09-06 北京奇艺世纪科技有限公司 Task creating method, system and storage medium
CN110489261A (en) * 2019-07-31 2019-11-22 上海艾融软件股份有限公司 Task handles alarm method, device and electronic equipment, storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952780B2 (en) * 2000-01-28 2005-10-04 Safecom A/S System and method for ensuring secure transfer of a document from a client of a network to a printer
GB2465860A (en) * 2008-12-04 2010-06-09 Ibm A directed graph behaviour model for monitoring a computer system in which each node of the graph represents an event generated by an application
CN102521099A (en) * 2011-11-24 2012-06-27 深圳市同洲视讯传媒有限公司 Process monitoring method and process monitoring system
CN103902646A (en) * 2013-12-27 2014-07-02 北京天融信软件有限公司 Distributed task managing system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952780B2 (en) * 2000-01-28 2005-10-04 Safecom A/S System and method for ensuring secure transfer of a document from a client of a network to a printer
GB2465860A (en) * 2008-12-04 2010-06-09 Ibm A directed graph behaviour model for monitoring a computer system in which each node of the graph represents an event generated by an application
CN102521099A (en) * 2011-11-24 2012-06-27 深圳市同洲视讯传媒有限公司 Process monitoring method and process monitoring system
CN103902646A (en) * 2013-12-27 2014-07-02 北京天融信软件有限公司 Distributed task managing system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIUQIYUN: "Oozie 与 Yarn 协同工作,https://blog.csdn.net/samhacker/article/details/21413057", 《CSDN》 *
MSDN: "How to schedule sqoop job command using oozie in azure,https://social.msdn.microsoft.com/Forums/en-US/a520f0df-ef13-48c0-b737-218625047284/how-to-schedule-sqoop-job-command-using-oozie-in-azure-hdinsightremote-machine?forum=hdinsight", 《MSDN》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776579A (en) * 2018-06-19 2018-11-09 郑州云海信息技术有限公司 A kind of distributed storage cluster expansion method, device, equipment and storage medium
CN108776579B (en) * 2018-06-19 2021-10-15 郑州云海信息技术有限公司 Distributed storage cluster expansion method, device, equipment and storage medium
CN109471709A (en) * 2018-10-16 2019-03-15 深圳中顺易金融服务有限公司 The dispatching method of flow tasks based on Apache Oozie frame processing big data
CN110209893A (en) * 2019-04-23 2019-09-06 北京奇艺世纪科技有限公司 Task creating method, system and storage medium
CN110489261A (en) * 2019-07-31 2019-11-22 上海艾融软件股份有限公司 Task handles alarm method, device and electronic equipment, storage medium

Also Published As

Publication number Publication date
CN107204868B (en) 2020-08-18

Similar Documents

Publication Publication Date Title
CN111756582B (en) Service chain monitoring method based on NFV log alarm
CN105426292B (en) A kind of games log real time processing system and method
CN109885496B (en) Test log management method and system
WO2019223062A1 (en) Method and system for processing system exceptions
CN104036365A (en) Method for constructing enterprise-level data service platform
CN109388537B (en) Operation information tracking method and device and computer readable storage medium
CN110275992B (en) Emergency treatment method, device, server and computer-readable storage medium
CN105159964A (en) Log monitoring method and system
WO2017114152A1 (en) Service dial testing method, apparatus and system
CN113452607B (en) Distributed link acquisition method, device, computing device and storage medium
CN109088773B (en) Fault self-healing method and device, server and storage medium
WO2022042007A1 (en) Method, system, and device for locating micro-service fuse anomaly, and medium
CN107204868B (en) Task operation monitoring information acquisition method and device
CN108632111A (en) Service link monitoring method based on log
CN119226174B (en) Test methods, equipment, storage media and program products
CN117149894A (en) A display method, device, electronic device and storage medium for calling links
CN100499482C (en) A method for monitoring user behavior in network management system
CN112199275B (en) Component interface test analysis method and device, server and storage medium
CN110221936A (en) Database alert processing method, device, equipment and computer readable storage medium
CN103645985B (en) Source code macro-pairing detection method
CN111913824A (en) Method for determining data link fault reason and related equipment
CN107635003A (en) System log management method, device and system
CN118210772B (en) Log management method, device, electronic device and storage medium
CN118827316B (en) Monitoring and early warning methods, devices, electronic equipment, storage media, and computer program products
CN114885014A (en) Method, device, equipment and medium for monitoring external field equipment state

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant