[go: up one dir, main page]

CN114461506A - Cluster alarm control method, device, electronic device and storage medium - Google Patents

Cluster alarm control method, device, electronic device and storage medium Download PDF

Info

Publication number
CN114461506A
CN114461506A CN202210055230.2A CN202210055230A CN114461506A CN 114461506 A CN114461506 A CN 114461506A CN 202210055230 A CN202210055230 A CN 202210055230A CN 114461506 A CN114461506 A CN 114461506A
Authority
CN
China
Prior art keywords
alarm
item
alarm item
cluster
items
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210055230.2A
Other languages
Chinese (zh)
Inventor
孙吴昊
郭广路
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing IQIYI Science and Technology Co Ltd
Original Assignee
Beijing IQIYI Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing IQIYI Science and Technology Co Ltd filed Critical Beijing IQIYI Science and Technology Co Ltd
Priority to CN202210055230.2A priority Critical patent/CN114461506A/en
Publication of CN114461506A publication Critical patent/CN114461506A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Alarm Systems (AREA)

Abstract

The embodiment of the invention provides a cluster alarm control method, which comprises the following steps: in the cluster working process, under the condition that a first alarm item on a management side is detected, judging whether a second alarm item on a user side having a relationship with the first alarm item exists or not; if the second alarm item exists currently, whether alarm information aiming at the first alarm item is sent out or not is judged according to the relation between the alarm data of the first alarm item and the alarm data of the second alarm item. By applying the technical scheme provided by the embodiment of the invention, the alarm item at the management side and the alarm item at the user side are combined to judge whether the alarm information is sent out or not, so that the accuracy of the alarm information can be improved, the false alarm rate of the alarm information is reduced, and the unnecessary cost of manpower and material resources can be effectively avoided being consumed. The embodiment of the invention also provides a cluster alarm control device, electronic equipment and a computer readable storage medium, and the cluster alarm control device, the electronic equipment and the computer readable storage medium have corresponding technical effects.

Description

集群告警控制方法、装置、电子设备及存储介质Cluster alarm control method, device, electronic device and storage medium

技术领域technical field

本发明涉及计算机应用技术领域,特别是涉及一种集群告警控制方法、装置、电子设备及存储介质。The present invention relates to the technical field of computer applications, in particular to a cluster alarm control method, device, electronic device and storage medium.

背景技术Background technique

随着计算机技术的快速发展,集群技术逐渐发展起来。通过集群技术,可以在付出较低成本的情况下获得性能、可靠性、灵活性等方面相对较高的收益。在企事业单位中,为提高业务处理能力,部署的集群数量越来越多。With the rapid development of computer technology, cluster technology has gradually developed. Through cluster technology, relatively high benefits in terms of performance, reliability, flexibility, etc. can be obtained at a lower cost. In enterprises and institutions, more and more clusters are deployed to improve business processing capabilities.

为了保障集群的稳定性,通常会在管理侧通过设置告警机制来诊断集群的健康状态。一旦集群出现问题,就会发出告警信息。管理员根据告警信息对集群进行查看,解决问题。To ensure the stability of the cluster, an alarm mechanism is usually set on the management side to diagnose the health status of the cluster. Once there is a problem with the cluster, an alert message will be issued. The administrator can view the cluster based on the alarm information and solve the problem.

目前这种告警处理方式可以在一定程度上及时解决集群出现的问题,但是,因为集群拥有一定的自愈能力,有时在出现问题后可以在较短时间内自动恢复,所以,如果一出现问题就发出告警信息,将会导致告警信息的误报,如果误报率较高,那么管理员根据告警信息对集群进行查看的过程,将耗费较多不必要的人力物力成本。At present, this alarm processing method can solve the problems of the cluster in time to a certain extent. However, because the cluster has a certain self-healing ability, it can automatically recover in a short time after a problem occurs. Sending alarm information will result in false alarm information. If the false alarm rate is high, the process of viewing the cluster based on the alarm information will cost the administrator unnecessary labor and material costs.

发明内容SUMMARY OF THE INVENTION

本发明实施例的目的在于提供一种集群告警控制方法、装置、电子设备及存储介质,以降低告警信息的误报率,避免耗费较多不必要的人力物力成本。具体技术方案如下:The purpose of the embodiments of the present invention is to provide a cluster alarm control method, device, electronic device and storage medium, so as to reduce the false alarm rate of alarm information and avoid unnecessary labor and material costs. The specific technical solutions are as follows:

在本发明实施的第一方面,首先提供了一种集群告警控制方法,包括:In the first aspect of the implementation of the present invention, a cluster alarm control method is first provided, including:

在集群工作过程中,检测到管理侧的第一告警项的情况下,判断当前是否存在与所述第一告警项有关联关系的用户侧的第二告警项;During the cluster operation, when the first alarm item on the management side is detected, determine whether there is currently a second alarm item on the user side that is associated with the first alarm item;

如果当前存在所述第二告警项,则根据所述第一告警项的告警数据和所述第二告警项的告警数据之间的关系,判断是否发出针对所述第一告警项的告警信息。If the second alarm item currently exists, according to the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item, it is determined whether to send alarm information for the first alarm item.

在本发明的一种具体实施方式中,所述根据所述第一告警项的告警数据和所述第二告警项的告警数据之间的关系,判断是否发出针对所述第一告警项的告警信息,包括:In a specific embodiment of the present invention, determining whether to issue an alarm for the first alarm item according to the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item information, including:

根据所述第一告警项的告警数据和所述第二告警项的告警数据,确定所述第一告警项的告警等级与所述第二告警项的告警等级的大小关系;According to the alarm data of the first alarm item and the alarm data of the second alarm item, determine the magnitude relationship between the alarm level of the first alarm item and the alarm level of the second alarm item;

如果所述第二告警项的告警等级大于或等于所述第一告警项的告警等级,则确定发出针对所述第一告警项的告警信息。If the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, it is determined to send alarm information for the first alarm item.

在本发明的一种具体实施方式中,在所述第二告警项的告警等级大于或等于所述第一告警项的告警等级的情况下,在所述确定发出针对所述第一告警项的告警信息之前,还包括:In a specific embodiment of the present invention, in the case that the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, when the determination is made to issue an alarm for the first alarm item Before the alarm information, it also includes:

根据所述第一告警项的告警数据和所述第二告警项的告警数据,确定所述第二告警项的管理侧关联告警项集合与所述第一告警项的包含关系,所述第二告警项的管理侧关联告警项集合包括预先设定的与所述第二告警项具有关联关系的管理侧的告警项;According to the alarm data of the first alarm item and the alarm data of the second alarm item, determine the inclusion relationship between the management-side associated alarm item set of the second alarm item and the first alarm item, and the second alarm item The set of alarm items associated with the management side of the alarm item includes a preset alarm item on the management side that has an associated relationship with the second alarm item;

如果所述第二告警项的管理侧关联告警项集合包含所述第一告警项,则执行所述确定发出针对所述第一告警项的告警信息的步骤。If the management-side associated alarm item set of the second alarm item includes the first alarm item, the step of determining to issue alarm information for the first alarm item is performed.

在本发明的一种具体实施方式中,在当前不存在所述第二告警项的情况下,还包括:In a specific implementation manner of the present invention, when the second alarm item does not currently exist, the method further includes:

判断在检测到所述第一告警项时刻之前的判断等待时长内,是否存在与所述第一告警项对应的告警事件相同的未处理完成的告警项;Judging whether there is an unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item within the judgment waiting time period before the moment when the first alarm item is detected;

如果不存在与所述第一告警项对应的告警事件相同的未处理完成的告警项,则忽略所述第一告警项。If there is no unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item, the first alarm item is ignored.

在本发明的一种具体实施方式中,在存在与所述第一告警项对应的告警事件相同的未处理完成的告警项的情况下,还包括:In a specific embodiment of the present invention, when there is an unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item, the method further includes:

根据与所述第一告警项对应的告警事件相同的未处理完成的告警项的数量,判断是否发出针对所述第一告警项的告警信息。According to the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item, it is determined whether to send alarm information for the first alarm item.

在本发明的一种具体实施方式中,所述根据与所述第一告警项对应的告警事件相同的未处理完成的告警项的数量,判断是否发出针对所述第一告警项的告警信息,包括:In a specific embodiment of the present invention, it is determined whether to send alarm information for the first alarm item according to the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item, include:

如果与所述第一告警项对应的告警事件相同的未处理完成的告警项的数量在设定的第一数量范围内,则忽略所述第一告警项;If the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item is within the set first number range, ignore the first alarm item;

如果与所述第一告警项对应的告警事件相同的未处理完成的告警项的数量在设定的第二数量范围内,则确定发出针对所述第一告警项的告警信息;If the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item is within the set second number range, determine to send alarm information for the first alarm item;

所述第一数量范围的上限值小于所述第二数量范围的下限值。The upper limit value of the first quantity range is smaller than the lower limit value of the second quantity range.

在本发明的一种具体实施方式中,在检测到管理侧的第一告警项的情况下,在所述判断当前是否存在与所述第一告警项有关联关系的用户侧的第二告警项之前,还包括:In a specific embodiment of the present invention, when the first alarm item on the management side is detected, in the process of determining whether there is currently a second alarm item on the user side that is associated with the first alarm item Before, also included:

确定所述第一告警项的告警类型;determining the alarm type of the first alarm item;

如果所述第一告警项的告警类型为设定的低灵敏类型,则判断在达到所述第一告警项的自愈允许等待时长时,所述第一告警项对应的告警事件是否被成功恢复;If the alarm type of the first alarm item is the set low-sensitivity type, determine whether the alarm event corresponding to the first alarm item is successfully recovered when the allowable waiting time for self-healing of the first alarm item is reached. ;

如果所述第一告警项对应的告警事件未被成功恢复,则执行所述判断当前是否存在与所述第一告警项有关联关系的用户侧的第二告警项的步骤。If the alarm event corresponding to the first alarm item is not successfully recovered, the step of judging whether there is currently a second alarm item on the user side that is associated with the first alarm item is performed.

在本发明的一种具体实施方式中,在所述第一告警项的告警类型为设定的低灵敏类型的情况下,还包括:In a specific embodiment of the present invention, when the alarm type of the first alarm item is the set low-sensitivity type, the method further includes:

将所述第一告警项的告警状态标记为预恢复状态;marking the alarm state of the first alarm item as a pre-recovery state;

如果在所述第一告警项的自愈允许等待时长内,所述第一告警项对应的告警事件被成功恢复,则将所述第一告警项的告警状态由所述预恢复状态更新为恢复状态;If the alarm event corresponding to the first alarm item is successfully recovered within the allowable waiting time for self-healing of the first alarm item, the alarm state of the first alarm item is updated from the pre-recovery state to recovery state;

如果在所述第一告警项的自愈允许等待时长内,所述第一告警项对应的告警事件未被成功恢复,则将所述第一告警项的告警状态由所述预恢复状态更新为故障状态,并在发出针对所述第一告警项的告警信息后检测到所述第一告警项对应的告警事件被成功处理的情况下,将所述第一告警项的告警状态由所述故障状态更新为处理完成状态。If the alarm event corresponding to the first alarm item is not successfully recovered within the allowable waiting time for self-healing of the first alarm item, the alarm state of the first alarm item is updated from the pre-recovery state to If the alarm event corresponding to the first alarm item is detected to be successfully processed after the alarm information for the first alarm item is sent, the alarm state of the first alarm item is changed from the fault state to the fault state. The status is updated to processing complete status.

在本发明实施的第二方面,还提供了一种集群告警控制装置,包括:In a second aspect of the implementation of the present invention, a cluster alarm control device is also provided, including:

关联告警项是否存在判断模块,用于在集群工作过程中,检测到管理侧的第一告警项的情况下,判断当前是否存在与所述第一告警项有关联关系的用户侧的第二告警项;The module for determining whether the associated alarm item exists is used to determine whether there is currently a second alarm on the user side that is associated with the first alarm item when the first alarm item on the management side is detected during the cluster operation. item;

告警信息是否发出判断模块,用于在当前存在所述第二告警项的情况下,根据所述第一告警项的告警数据和所述第二告警项的告警数据之间的关系,判断是否发出针对所述第一告警项的告警信息。A module for judging whether the alarm information is sent out is used to determine whether to send out the alarm information according to the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item when the second alarm item currently exists Alarm information for the first alarm item.

在本发明实施的又一方面,还提供了一种电子设备,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;In yet another aspect of the implementation of the present invention, an electronic device is also provided, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;

存储器,用于存放计算机程序;memory for storing computer programs;

处理器,用于执行存储器上所存放的程序时,实现上述的集群告警控制方法的步骤。The processor is configured to implement the steps of the above cluster alarm control method when executing the program stored in the memory.

在本发明实施的又一方面,还提供了一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现上述的集群告警控制方法。In another aspect of the implementation of the present invention, there is also provided a computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the above-mentioned cluster alarm control method is implemented.

在本发明实施的又一方面,还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述的集群告警控制方法。In yet another aspect of the implementation of the present invention, there is also provided a computer program product including instructions, which, when running on a computer, enables the computer to execute the above-mentioned cluster alarm control method.

本发明实施例提供的技术方案,在检测到管理侧的第一告警项的情况下,并不是直接发出针对第一告警项的告警信息,而是先判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项,如果存在,则认为管理侧的第一告警项对应的告警事件可能已经影响到用户侧的使用,然后根据第一告警项的告警数据和第二告警项的告警数据之间的关系,判断是否发出针对第一告警项的告警信息,也就是说,通过管理侧的告警项和用户侧的告警项的结合,判断是否发出告警信息,可以提高告警信息的准确性,降低告警信息的误报率,可以有效避免耗费较多不必要的人力物力成本。In the technical solution provided by the embodiment of the present invention, when the first alarm item on the management side is detected, instead of directly sending out alarm information for the first alarm item, it is first judged whether there is currently an association with the first alarm item. If the second alarm item on the user side of the relationship exists, it is considered that the alarm event corresponding to the first alarm item on the management side may have affected the use of the user side. The relationship between the alarm data, to determine whether to send the alarm information for the first alarm item, that is, to determine whether to send the alarm information through the combination of the alarm item on the management side and the alarm item on the user side, which can improve the accuracy of the alarm information. It can reduce the false alarm rate of alarm information, and can effectively avoid unnecessary labor and material costs.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that are required in the description of the embodiments or the prior art.

图1为本发明实施例中监控告警平台的组成架构示意图;1 is a schematic diagram of the composition structure of a monitoring and alarm platform in an embodiment of the present invention;

图2为本发明实施例中一种集群告警控制方法的实施流程图;FIG. 2 is an implementation flowchart of a cluster alarm control method in an embodiment of the present invention;

图3为本发明实施例中集群告警控制的具体过程示意图;FIG. 3 is a schematic diagram of a specific process of cluster alarm control in an embodiment of the present invention;

图4为本发明实施例中一种集群告警控制装置的结构示意图;FIG. 4 is a schematic structural diagram of a cluster alarm control device in an embodiment of the present invention;

图5为本发明实施例中一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述。The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

本发明的核心是提供一种集群告警控制方法,该方法可以应用于监控告警平台。为了便于理解,下面先对本发明的技术方案所适用的监控告警平台的组成架构进行介绍。The core of the present invention is to provide a cluster alarm control method, which can be applied to a monitoring and alarm platform. In order to facilitate understanding, the composition structure of the monitoring and alarm platform to which the technical solution of the present invention is applied will be introduced first.

参见图1,其示出了监控告警平台的组成架构。在监控告警平台中,集群控制中心可以对一个或多个集群的健康状态进行监控,如对集群A、集群B、集群C、…、集群n-1、集群n、集群n+1等的健康状态进行监控,在有集群出现问题时,可以生成管理侧的告警项。同时,在集群出现的问题影响到用户的使用时,用户应用管理中心可以生成用户侧的告警项。告警控制中心可以获得管理侧的告警项和用户侧的告警项,并结合管理侧的告警项和用户侧的告警项,判断是否发出告警信息。这样可以提高告警信息的准确性,降低告警信息的误报率,可以有效避免耗费较多不必要的人力物力成本。Referring to FIG. 1 , it shows the composition architecture of the monitoring and alarming platform. In the monitoring and alarm platform, the cluster control center can monitor the health status of one or more clusters, such as the health status of cluster A, cluster B, cluster C, ..., cluster n-1, cluster n, cluster n+1, etc. The status is monitored, and when there is a problem with the cluster, an alarm item on the management side can be generated. At the same time, when a problem in the cluster affects the use of the user, the user application management center can generate an alarm item on the user side. The alarm control center can obtain the alarm items on the management side and the alarm items on the user side, and combine the alarm items on the management side and the alarm items on the user side to determine whether to send alarm information. In this way, the accuracy of the alarm information can be improved, the false alarm rate of the alarm information can be reduced, and the unnecessary cost of manpower and material resources can be effectively avoided.

上面在本发明实施例所应用的监控告警平台的角度对本发明实施例的技术方案进行了整体描述,下面对本发明实施例的具体实现进行详细描述。The technical solutions of the embodiments of the present invention are generally described above from the perspective of the monitoring and alarm platform applied in the embodiments of the present invention, and the specific implementation of the embodiments of the present invention is described in detail below.

参见图2所示,为本发明实施例所提供的一种集群告警控制方法的实施流程图,该方法可以应用于监控告警平台中的告警控制中心,可以包括以下步骤:Referring to FIG. 2, it is an implementation flowchart of a cluster alarm control method provided by an embodiment of the present invention. The method can be applied to an alarm control center in a monitoring and alarm platform, and may include the following steps:

S210:在集群工作过程中,检测到管理侧的第一告警项的情况下,判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项。S210: In the case of detecting the first alarm item on the management side during the cluster operation, determine whether there is currently a second alarm item on the user side that is associated with the first alarm item.

可以理解的是,集群工作过程中,难免会出现一些问题,比如,应用程序接口服务器(api-server)进程异常、控制器管理器(controller manager)进程异常、磁盘剩余可用空间小于设定空间阈值等问题。在集群出现问题时,将会生成管理侧的告警项。管理侧的告警项可以认为是管理员视角的告警项,其基于在管理员视角设置的告警机制生成。It is understandable that some problems will inevitably occur during the working process of the cluster, for example, the application program interface server (api-server) process is abnormal, the controller manager (controller manager) process is abnormal, and the remaining free space on the disk is less than the set space threshold. And other issues. When there is a problem with the cluster, an alarm item on the management side will be generated. The alarm items on the management side can be considered as alarm items from the administrator's perspective, which are generated based on the alarm mechanism set in the administrator's perspective.

当集群出现的问题影响到用户侧的使用时,将会生成用户侧的告警项。用户侧的告警项可以认为是用户视角的告警项,其基于在用户视角设置的告警机制生成。When a problem in the cluster affects the usage on the user side, an alarm item on the user side will be generated. The alarm items on the user side can be considered as alarm items from the user's perspective, which are generated based on the alarm mechanism set in the user's perspective.

具体的,管理侧的告警项可以是监控告警平台中的集群控制中心在有集群出现问题时生成,用户侧的告警项可以是监控告警平台中的用户应用管理中心在监测到集群出现的问题影响到用户的使用时生成。也就是说,通过集群控制中心可以获得管理侧的告警项,通过用户应用管理中心可以获得用户侧的告警项。Specifically, the alarm item on the management side can be generated by the cluster control center in the monitoring and alarm platform when there is a problem with the cluster, and the alarm item on the user side can be the user application management center in the monitoring and alarm platform. Generated when the user uses it. That is to say, the alarm items on the management side can be obtained through the cluster control center, and the alarm items on the user side can be obtained through the user application management center.

但可以理解的是,集群出现的问题并非都会影响到用户侧的使用,比如,集群出现磁盘剩余可用空间小于设定空间阈值的问题时,针对该问题将会生成管理侧的告警项,但是该问题并不会影响到用户侧的使用,也就不会生成用户侧的告警项。However, it is understandable that not all problems in the cluster will affect the use of the user side. For example, when the remaining free disk space of the cluster is less than the set space threshold, an alarm item on the management side will be generated for this problem, but the The problem will not affect the usage on the user side, and thus no alarm items on the user side will be generated.

在本发明实施例中,可以预先建立管理侧的告警项与用户侧的告警项的关联关系。举例而言,管理侧的告警项可以记为

Figure BDA0003475927330000061
其对应的有关联关系的用户侧的告警项集合,即用户侧关联告警项集合可以记为
Figure BDA0003475927330000062
Figure BDA0003475927330000063
用户侧的告警项可以记为
Figure BDA0003475927330000064
其对应的有关联关系的管理侧的告警项集合,即管理侧关联告警项集合可以记为
Figure BDA0003475927330000065
Figure BDA0003475927330000066
同时,可以设定每个告警项与其对应的关联告警项集合中的每个告警项的关联性,如强关联性或弱关联性等。针对于一个告警项而言,还可以通过多个告警项集合表示与该告警项具有不同关联关系的告警项。In this embodiment of the present invention, an association relationship between an alarm item on the management side and an alarm item on the user side may be established in advance. For example, the alarm item on the management side can be recorded as
Figure BDA0003475927330000061
The corresponding set of alarm items on the user side with associated relationship, that is, the set of related alarm items on the user side can be recorded as
Figure BDA0003475927330000062
Figure BDA0003475927330000063
The alarm item on the user side can be recorded as
Figure BDA0003475927330000064
The corresponding set of alarm items on the management side with the associated relationship, that is, the set of related alarm items on the management side can be recorded as
Figure BDA0003475927330000065
Figure BDA0003475927330000066
At the same time, the correlation between each alarm item and each alarm item in the corresponding set of associated alarm items can be set, such as strong correlation or weak correlation. For an alarm item, multiple alarm item sets may also be used to represent alarm items that have different correlations with the alarm item.

在集群工作过程中,如果检测到管理侧的第一告警项,则可以先判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项。具体的,可以根据预先建立的管理侧的告警项与用户侧的告警项的关联关系,判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项。可以在数据库中记录管理侧的告警项与用户侧的告警项的关联关系,当告警控制中心检测到管理侧的第一告警项时,可以通过查询数据库中记录的关联关系判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项。During the cluster operation, if the first alarm item on the management side is detected, it may be first determined whether there is currently a second alarm item on the user side that is associated with the first alarm item. Specifically, it may be determined whether there is currently a second alarm item on the user side associated with the first alarm item according to the pre-established association relationship between the alarm item on the management side and the alarm item on the user side. The relationship between the alarm item on the management side and the alarm item on the user side can be recorded in the database. When the alarm control center detects the first alarm item on the management side, it can check the relationship recorded in the database to determine whether there is currently a relationship with the first alarm item. An alarm item is associated with a second alarm item on the user side.

如果当前存在与第一告警项有关联关系的用户侧的第二告警项,则可以认为第一告警项对应的告警事件已经影响到用户侧的使用,可以继续执行后续步骤的操作。如果当前不存在与第一告警项有关联关系的用户侧的第二告警项,则可以认为第一告警项对应的告警事件还未影响到用户侧的使用,可以对第一告警项不作处理,或者基于其他处理规则对其进行处理。If there is currently a second alarm item on the user side that is associated with the first alarm item, it can be considered that the alarm event corresponding to the first alarm item has affected the use of the user side, and the operations of the subsequent steps can be continued. If there is currently no second alarm item on the user side that is associated with the first alarm item, it can be considered that the alarm event corresponding to the first alarm item has not yet affected the use of the user side, and the first alarm item can be ignored. Or process it based on other processing rules.

S220:如果当前存在第二告警项,则根据第一告警项的告警数据和第二告警项的告警数据之间的关系,判断是否发出针对第一告警项的告警信息。S220: If the second alarm item currently exists, determine whether to send alarm information for the first alarm item according to the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item.

在本发明实施例中,可以预先设定管理侧的告警项的告警数据的第一数据格式,以及用户侧的告警项的告警数据的第二数据格式。在生成管理侧的告警项后,可以按照设定的第一数据格式,对该告警项的告警数据进行格式化处理,将其转换为具有第一数据格式的告警数据。在生成用户侧的告警项后,可以按照设定的第二数据格式,对该告警项的告警数据进行格式化处理,将其转换为具有第二数据格式的告警数据。以方便后续基于告警数据进行关系确定。In this embodiment of the present invention, the first data format of the alarm data of the alarm item on the management side and the second data format of the alarm data of the alarm item on the user side may be preset. After the alarm item on the management side is generated, the alarm data of the alarm item may be formatted according to the set first data format, and converted into alarm data having the first data format. After the alarm item on the user side is generated, the alarm data of the alarm item may be formatted according to the set second data format, and converted into alarm data having the second data format. In order to facilitate subsequent relationship determination based on the alarm data.

比如,具有第一数据格式的管理侧的告警项的告警数据为:

Figure BDA0003475927330000071
Figure BDA0003475927330000072
具有第二数据格式的用户侧的告警项的告警数据为:
Figure BDA0003475927330000073
For example, the alarm data of the alarm item on the management side having the first data format is:
Figure BDA0003475927330000071
Figure BDA0003475927330000072
The alarm data of the alarm item on the user side with the second data format is:
Figure BDA0003475927330000073

其中,Alert为具体告警内容,即告警事件;

Figure BDA0003475927330000074
为用户侧关联告警项集合;
Figure BDA0003475927330000075
为管理侧关联告警项集合;
Figure BDA0003475927330000076
为管理侧的告警项的告警状态;P为告警等级;Htime为自愈允许等待时长,如果告警项的告警类型为高灵敏类型,则Htime为0,即高灵敏类型的告警项没有自愈等待时长;Wtime为判断等待时长,单位可以为秒。Among them, Alert is the specific alarm content, that is, the alarm event;
Figure BDA0003475927330000074
A collection of associated alarm items for the user side;
Figure BDA0003475927330000075
A collection of associated alarm items for the management side;
Figure BDA0003475927330000076
is the alarm status of the alarm item on the management side; P is the alarm level; Htime is the allowable waiting time for self-healing. If the alarm type of the alarm item is a high-sensitivity type, Htime is 0, that is, the alarm item of the high-sensitivity type has no self-healing waiting time. Duration; Wtime is the waiting time for judgment, and the unit can be seconds.

需要说明的是,上述第一数据格式和第二数据格式仅为具体示例,在具体实施过程中可以设定其他不同的格式,其中包括的数据项也可以增加或减少。It should be noted that the above-mentioned first data format and second data format are only specific examples, and other different formats may be set in the specific implementation process, and the data items included therein may also be increased or decreased.

在检测到管理侧的第一告警项的情况下,如果确定当前存在与第一告警项有关联关系的用户侧的第二告警项,则可以认为第一告警项对应的告警事件已经影响到用户侧的使用,可以进一步确定第一告警项的告警数据和第二告警项的告警数据之间的关系。然后根据第一告警项的告警数据和第二告警项的告警数据之间的关系,判断是否发出针对第一告警项的告警信息。When the first alarm item on the management side is detected, if it is determined that there is currently a second alarm item on the user side that is associated with the first alarm item, it can be considered that the alarm event corresponding to the first alarm item has affected the user By using the side, the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item can be further determined. Then, according to the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item, it is determined whether to send alarm information for the first alarm item.

应用本发明实施例所提供的方法,在检测到管理侧的第一告警项的情况下,并不是直接发出针对第一告警项的告警信息,而是先判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项,如果存在,则认为管理侧的第一告警项对应的告警事件可能已经影响到用户侧的使用,然后根据第一告警项的告警数据和第二告警项的告警数据之间的关系,判断是否发出针对第一告警项的告警信息,也就是说,通过管理侧的告警项和用户侧的告警项的结合,判断是否发出告警信息,可以提高告警信息的准确性,降低告警信息的误报率,可以有效避免耗费较多不必要的人力物力成本。By applying the method provided by the embodiment of the present invention, when the first alarm item on the management side is detected, instead of directly sending out alarm information for the first alarm item, it is first judged whether there is currently a problem with the first alarm item. If the second alarm item on the user side of the associated relationship exists, it is considered that the alarm event corresponding to the first alarm item on the management side may have affected the use of the user side. Then, according to the alarm data of the first alarm item and the second alarm item The relationship between the alarm data and the alarm data, determine whether to send the alarm information for the first alarm item, that is to say, through the combination of the alarm item on the management side and the alarm item on the user side, judging whether to send the alarm information can improve the accuracy of the alarm information. Accuracy, reduce the false alarm rate of alarm information, and can effectively avoid unnecessary labor and material costs.

在本发明的一个实施例中,步骤S220可以包括以下步骤:In an embodiment of the present invention, step S220 may include the following steps:

步骤一:根据第一告警项的告警数据和第二告警项的告警数据,确定第一告警项的告警等级与第二告警项的告警等级的大小关系;Step 1: according to the alarm data of the first alarm item and the alarm data of the second alarm item, determine the magnitude relationship between the alarm level of the first alarm item and the alarm level of the second alarm item;

步骤二:如果第二告警项的告警等级大于或等于第一告警项的告警等级,则确定发出针对第一告警项的告警信息。Step 2: If the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, it is determined to send alarm information for the first alarm item.

为方便描述,将上述两个步骤结合起来进行说明。For the convenience of description, the above two steps are combined for description.

在本发明实施例中,在检测到管理侧的第一告警项的情况下,如果确定当前存在与第一告警项有关联关系的用户侧的第二告警项,则可以认为第一告警项对应的告警事件已经影响到用户侧的使用,可以进一步确定第一告警项的告警数据和第二告警项的告警数据之间的关系。In this embodiment of the present invention, when the first alarm item on the management side is detected, if it is determined that there is currently a second alarm item on the user side that is associated with the first alarm item, it can be considered that the first alarm item corresponds to The alarm event of the first alarm item has affected the usage on the user side, and the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item can be further determined.

具体的,可以根据第一告警项的告警数据和第二告警项的告警数据,对比得到第一告警项的告警等级与第二告警项的告警等级之间的大小关系。确定出第一告警项的告警等级与第二告警项的告警等级的大小关系后,根据该大小关系,判断是否发出针对第一告警项的告警信息。Specifically, according to the alarm data of the first alarm item and the alarm data of the second alarm item, the magnitude relationship between the alarm level of the first alarm item and the alarm level of the second alarm item can be obtained by comparison. After determining the magnitude relationship between the alarm level of the first alarm item and the alarm level of the second alarm item, it is judged whether to send alarm information for the first alarm item according to the magnitude relationship.

管理侧的告警项和用户侧的告警项可以对应有告警等级,按照告警事件影响程度可以分为多个告警等级。比如,分为4个告警等级:P1、P2、P3、P4。其中,告警等级为P1的告警事件为重要告警事件,一旦发生需要立即处理;告警等级为P2的告警事件为较重要告警事件,允许数分钟内恢复;告警等级为P3的告警事件为不重要告警事件,允许在小时级别的时间内恢复;告警等级为P4的告警事件为通知事件,一般不需要关注。当然,这4个告警等级仅为具体示例,在具体实施过程中可以设定更多或更少的告警等级。The alarm items on the management side and the alarm items on the user side can correspond to alarm levels, and can be divided into multiple alarm levels according to the impact degree of the alarm event. For example, it is divided into 4 alarm levels: P1, P2, P3, and P4. Among them, the alarm events with the alarm level of P1 are important alarm events, which need to be dealt with immediately once they occur; the alarm events with the alarm level of P2 are more important alarm events and can be recovered within a few minutes; the alarm events with the alarm level of P3 are unimportant alarms The event is allowed to recover within an hour-level time; an alarm event with an alarm level of P4 is a notification event and generally does not require attention. Of course, these four alarm levels are only specific examples, and more or less alarm levels may be set in the specific implementation process.

如果第二告警项的告警等级大于或等于第一告警项的告警等级,则可以认为第一告警项对应的告警事件对于用户侧的使用影响较大,在这种情况下,可以确定发出针对第一告警项的告警信息。这样可以提高告警信息准确性,减少误报。If the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, it can be considered that the alarm event corresponding to the first alarm item has a greater impact on the usage on the user side. Alarm information of an alarm item. This can improve the accuracy of alarm information and reduce false positives.

在本发明的一种具体实施方式中,在第二告警项的告警等级大于或等于第一告警项的告警等级的情况下,在确定发出针对第一告警项的告警信息之前,还可以根据第一告警项的告警数据和第二告警项的告警数据,确定第二告警项的管理侧关联告警项集合与第一告警项的包含关系,如果第二告警项的管理侧关联告警项集合包含第一告警项,则可以执行确定发出针对第一告警项的告警信息的步骤。第二告警项的管理侧关联告警项集合包括预先设定的与第二告警项具有关联关系的管理侧的告警项。In a specific embodiment of the present invention, when the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, before it is determined to send the alarm information for the first alarm item, the The alarm data of an alarm item and the alarm data of the second alarm item, determine the inclusion relationship between the management-side associated alarm item set of the second alarm item and the first alarm item, if the management-side associated alarm item set of the second alarm item includes the first alarm item an alarm item, the step of determining to issue alarm information for the first alarm item may be performed. The set of alarm items associated with the management side of the second alarm item includes preset alarm items on the management side that have an associated relationship with the second alarm item.

可以理解的是,如果第一告警项的告警数据出现异常,将会使得通过第一告警项的告警数据确定出的第一告警项与第二告警项的关联性有误,在这种情况下,仍然可能导致针对第一告警项的告警信息的误报。所以,在第二告警项的告警等级大于或等于第一告警项的告警等级的情况下,可以进一步判断第二告警项的管理侧关联告警项集合是否包含第一告警项,如果包含第一告警项,则认为第一告警项和第二告警项确实存在关联关系,可以确定发出针对第一告警项的告警信息。即在第二告警项的告警等级大于或等于第一告警项的告警等级,且第二告警项的管理侧关联告警项集合包含第一告警项的情况下,确定发出针对第一告警项的告警信息,进一步提高告警信息准确性,减少误报。It can be understood that if the alarm data of the first alarm item is abnormal, the correlation between the first alarm item and the second alarm item determined by the alarm data of the first alarm item will be incorrect. In this case , it may still lead to a false alarm of the alarm information for the first alarm item. Therefore, in the case that the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, it can be further determined whether the set of alarm items associated with the management side of the second alarm item contains the first alarm item, and if it contains the first alarm item item, it is considered that the first alarm item and the second alarm item do have an associated relationship, and it can be determined that alarm information for the first alarm item is sent. That is, when the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, and the set of alarm items associated with the management side of the second alarm item includes the first alarm item, it is determined to issue an alarm for the first alarm item information to further improve the accuracy of alarm information and reduce false positives.

如果第二告警项的告警等级小于第一告警项的告警等级,或者第二告警项的告警等级大于或等于第一告警项的告警等级,但第二告警项的管理侧关联告警项集合不包含第一告警项,则可以先忽略第一告警项,将其记录到日志等文件中,不发出针对第一告警项的告警信息。以减少误报,避免浪费较多不必要的人力物力成本。If the alarm level of the second alarm item is lower than the alarm level of the first alarm item, or the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, but the management-side associated alarm item set of the second alarm item does not contain For the first alarm item, the first alarm item may be ignored first, and it may be recorded in a file such as a log, and the alarm information for the first alarm item will not be issued. In order to reduce false positives and avoid wasting more unnecessary human and material costs.

在具体实施过程中,在检测到管理侧的第一告警项,且确定当前存在与第一告警项有关联关系的用户侧的第二告警项的情况下,可以先确定第二告警项与第一告警项的关联性,如强关联性或弱关联性等。In the specific implementation process, when the first alarm item on the management side is detected, and it is determined that there is currently a second alarm item on the user side that is associated with the first alarm item, the second alarm item and the first alarm item can be determined first. The correlation of an alarm item, such as strong correlation or weak correlation.

如果第二告警项与第一告警项具有强关联性,则可以在第二告警项的告警等级大于或等于第一告警项的告警等级,且第二告警项的关联告警项集合包含第一告警项的情况下,确定发出针对第一告警项的告警信息。If the second alarm item has a strong correlation with the first alarm item, the alarm level of the second alarm item may be greater than or equal to the alarm level of the first alarm item, and the associated alarm item set of the second alarm item includes the first alarm In the case of the item, it is determined to issue alarm information for the first alarm item.

如果第二告警项与第一告警项具有弱关联性,则可以在第二告警项的数量大于设定的数量阈值的情况下,确定发出针对第一告警项的告警信息。If the second alarm item has a weak correlation with the first alarm item, it may be determined that alarm information for the first alarm item is issued when the number of the second alarm items is greater than the set number threshold.

在本发明的一个实施例中,在当前不存在第二告警项的情况下,该方法还可以包括以下步骤:In an embodiment of the present invention, in the case that the second alarm item does not currently exist, the method may further include the following steps:

步骤一:判断在检测到第一告警项时刻之前的判断等待时长内,是否存在与第一告警项对应的告警事件相同的未处理完成的告警项;Step 1: judging whether there is an unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item within the judgment waiting time period before the moment when the first alarm item is detected;

步骤二:如果不存在与第一告警项对应的告警事件相同的未处理完成的告警项,则忽略第一告警项。Step 2: If there is no unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item, the first alarm item is ignored.

为便于描述,将上述两个步骤结合起来进行说明。For the convenience of description, the above two steps are combined for description.

如前所描述的,管理侧的告警项的告警数据可以具有第一数据格式,其中包括告警事件、对应的用户侧关联告警项集合、告警等级、告警状态、自愈允许等待时长、判断等待时长等数据项。不同告警项的判断等待时长可以基于告警等级设定,可以相同或不同。As described above, the alarm data of the alarm item on the management side may have the first data format, which includes the alarm event, the corresponding set of related alarm items on the user side, the alarm level, the alarm state, the allowable waiting time for self-healing, and the waiting time for judgment. and other data items. The judgment waiting time for different alarm items can be set based on the alarm level, which can be the same or different.

在检测到管理侧的第一告警项的情况下,如果当前不存在与第一告警项有关联关系的用户侧的第二告警项,则可以认为第一告警项对应的告警事件并未影响到用户侧的使用,可以进一步判断在检测到第一告警项时刻之前的判断等待时长内,是否存在与第一告警项对应的告警事件相同的未处理完成的告警项。告警事件相同的告警项的判断等待时长相同。When the first alarm item on the management side is detected, if there is currently no second alarm item on the user side that is associated with the first alarm item, it can be considered that the alarm event corresponding to the first alarm item does not affect the Using the user side, it is possible to further determine whether there is an unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item within the judgment waiting time period before the moment when the first alarm item is detected. Alarm items with the same alarm event have the same judgment waiting time.

如果在检测到第一告警项时刻之前的判断等待时长内,不存在与第一告警项对应的告警事件相同的未处理完成的告警项,则可以认为第一告警项为在判断等待时长内首次检测到的具有该告警事件的告警项,可以忽略第一告警项,不发出针对第一告警项的告警信息。这样可以有效避免告警信息的误报。同时可以将第一告警项的告警状态标记为预恢复状态,以在再次检测到与第一告警项对应的告警事件相同的告警项时,根据该告警状态可以确定第一告警项为未处理完成的告警项。如果在检测到第一告警项时刻之后的判断等待时长内,第一告警项被成功恢复,则可以将第一告警项的告警状态标记为恢复状态,以在再次检测到与第一告警项对应的告警事件相同的告警项时,根据该告警状态可以确定第一告警项不是未处理完成的告警项。If there is no unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item within the judgment waiting period before the first alarm item is detected, it can be considered that the first alarm item is the first alarm item within the judgment waiting period. For the detected alarm item with the alarm event, the first alarm item may be ignored, and no alarm information for the first alarm item is sent. In this way, false alarms of alarm information can be effectively avoided. At the same time, the alarm state of the first alarm item can be marked as a pre-recovery state, so that when an alarm item that is the same as the alarm event corresponding to the first alarm item is detected again, it can be determined according to the alarm state that the first alarm item is unprocessed. alarm item. If the first alarm item is successfully recovered within the judgment waiting time period after the first alarm item is detected, the alarm state of the first alarm item may be marked as the recovery state, so that when the first alarm item is detected again, the alarm state corresponding to the first alarm item is detected again. When the alarm events of the same alarm items are the same, it can be determined according to the alarm state that the first alarm item is not an unprocessed alarm item.

在本发明的一个实施例中,在存在与第一告警项对应的告警事件相同的未处理完成的告警项的情况下,该方法还可以包括以下步骤:In an embodiment of the present invention, when there is an unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item, the method may further include the following steps:

根据与第一告警项对应的告警事件相同的未处理完成的告警项的数量,判断是否发出针对第一告警项的告警信息。According to the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item, it is determined whether to send alarm information for the first alarm item.

在检测到管理侧的第一告警项的情况下,如果当前不存在与第一告警项有关联关系的用户侧的第二告警项,则可以认为第一告警项对应的告警事件并未影响到用户侧的使用,可以进一步判断在检测到第一告警项时刻之前的判断等待时长内,是否存在与第一告警项对应的告警事件相同的未处理完成的告警项。When the first alarm item on the management side is detected, if there is currently no second alarm item on the user side that is associated with the first alarm item, it can be considered that the alarm event corresponding to the first alarm item does not affect the Using the user side, it is possible to further determine whether there is an unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item within the judgment waiting time period before the moment when the first alarm item is detected.

如果在检测到第一告警项时刻之前的判断等待时长内,存在与第一告警项对应的告警事件相同的未处理完成的告警项,则可以认为在第一告警项生成之前还有具有相同告警事件的告警项,且未处理完成,在这种情况下,可以确定与第一告警项对应的告警事件相同的未处理完成的告警项的数量。根据该数量,可以判断是否发出针对第一告警项的告警信息。If there is an unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item within the judgment waiting time period before the first alarm item is detected, it can be considered that there are still alarm items with the same alarm before the first alarm item is generated The alarm item of the event has not been processed yet. In this case, the same number of unprocessed alarm items as the alarm event corresponding to the first alarm item can be determined. According to the number, it can be determined whether to send alarm information for the first alarm item.

具体的,如果与第一告警项对应的告警事件相同的未处理完成的告警项的数量在设定的第一数量范围内,则忽略第一告警项;如果与第一告警项对应的告警事件相同的未处理完成的告警项的数量在设定的第二数量范围内,则确定发出针对第一告警项的告警信息;第一数量范围的上限值小于第二数量范围的下限值。Specifically, if the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item is within the set first number range, the first alarm item is ignored; if the alarm event corresponding to the first alarm item is If the quantity of the same unprocessed alarm items is within the set second quantity range, it is determined to send alarm information for the first alarm item; the upper limit value of the first quantity range is smaller than the lower limit value of the second quantity range.

在本发明实施例中,可以预先设定第一数量范围和第二数量范围,第一数量范围的上限值小于第二数量范围的下限值。比如,设定第一数量范围为(0,1],设定第二数量范围为[2,5]。In this embodiment of the present invention, the first quantity range and the second quantity range may be preset, and the upper limit value of the first quantity range is smaller than the lower limit value of the second quantity range. For example, set the first number range to (0,1], and set the second number range to [2,5].

如果与第一告警项对应的告警事件相同的未处理完成的告警项的数量在设定的第一数量范围内,则可以认为第一告警项对应的告警事件可能正在自我恢复,可以忽略第一告警项,不发出针对第一告警项的告警信息。这样可以有效避免告警信息的误报。If the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item is within the set first number range, it can be considered that the alarm event corresponding to the first alarm item may be recovering itself, and the first alarm item can be ignored. Alarm item, no alarm information for the first alarm item is sent. In this way, false alarms of alarm information can be effectively avoided.

如果与第一告警项对应的告警事件相同的未处理完成的告警项的数量在设定的第二数量范围内,则可以认为第一告警项对应的告警事件比较紧急,可能无法自我恢复,可以确定发出针对第一告警项的告警信息,以基于该告警信息及时进行问题排查。同时,可以将第一告警项的告警状态标记为故障状态,在处理完成后,再将第一告警项的告警状态标记为处理完成状态。If the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item is within the set second number range, it can be considered that the alarm event corresponding to the first alarm item is relatively urgent and may not be self-recovery. It is determined to issue alarm information for the first alarm item, so as to perform troubleshooting in a timely manner based on the alarm information. At the same time, the alarm state of the first alarm item may be marked as a fault state, and after the processing is completed, the alarm state of the first alarm item may be marked as a processing completed state.

在本发明的一个实施例中,在检测到管理侧的第一告警项的情况下,在判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项之前,该方法还可以包括以下步骤:In an embodiment of the present invention, when the first alarm item on the management side is detected, before judging whether there is currently a second alarm item on the user side that is associated with the first alarm item, the method may also Include the following steps:

第一个步骤:确定第一告警项的告警类型;The first step: determine the alarm type of the first alarm item;

第二个步骤:如果第一告警项的告警类型为设定的低灵敏类型,则判断在达到第一告警项的自愈允许等待时长时,第一告警项对应的告警事件是否被成功恢复;The second step: if the alarm type of the first alarm item is the set low-sensitivity type, determine whether the alarm event corresponding to the first alarm item is successfully recovered when the allowable waiting time for self-healing of the first alarm item is reached;

第三个步骤:如果第一告警项对应的告警事件未被成功恢复,则执行判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项的步骤。The third step: if the alarm event corresponding to the first alarm item is not successfully recovered, the step of judging whether there is currently a second alarm item on the user side associated with the first alarm item is performed.

为方便描述,将上述三个步骤结合起来进行说明。For the convenience of description, the above three steps are combined for description.

管理侧的告警项的告警类型可以有多种,如高灵敏类型、低灵敏类型等。高灵敏度类型的告警项需要及时发出告警信息,如应用程序接口服务器进程异常、控制器管理器进程异常、DNS服务器/转发器无法启动等对应的告警项。低灵敏类型的告警项可以有一段时间的等待冗余,如磁盘剩余可用空间小于设定空间阈值、某一台机器重启、容器集合积压过多等对应的告警项。The alarm items on the management side can have multiple alarm types, such as high-sensitivity types and low-sensitivity types. Alarm items of high sensitivity type need to send alarm information in time, such as the corresponding alarm items such as application program interface server process abnormality, controller manager process abnormality, DNS server/forwarder failure to start, etc. Low-sensitivity alarm items can wait for a period of time for redundancy. For example, the remaining free disk space is less than the set space threshold, a certain machine is restarted, and the container collection has too much backlog.

对于高灵敏类型的告警项的自愈允许等待时长可设置为0,对于低灵敏类型的告警项的自愈允许等待时长可根据告警等级进行设定。The allowable waiting time for self-healing of alarm items of high sensitivity type can be set to 0, and the allowable waiting time for self-healing of alarm items of low sensitivity type can be set according to the alarm level.

在本发明实施例中,在检测到管理侧的第一告警项的情况下,可以先确定第一告警项的告警类型。In the embodiment of the present invention, when the first alarm item on the management side is detected, the alarm type of the first alarm item may be determined first.

如果第一告警项的告警类型为设定的低灵敏类型,则可以判断在达到第一告警项的自愈允许等待时长时,第一告警项对应的告警事件是否被成功恢复。如果在自愈允许等待时长内,第一告警项对应的告警事件未被成功恢复,则可以认为第一告警项对应的告警事件无法自愈,可以执行判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项及其以下的步骤,即结合用户侧的告警项,判断是否发出针对第一告警项的告警信息,以降低误报率。If the alarm type of the first alarm item is the set low sensitivity type, it can be determined whether the alarm event corresponding to the first alarm item is successfully recovered when the allowable waiting time for self-healing of the first alarm item is reached. If the alarm event corresponding to the first alarm item is not successfully recovered within the allowable waiting time for self-healing, it can be considered that the alarm event corresponding to the first alarm item cannot be self-healing, and a judgment can be performed to determine whether there is a current associated with the first alarm item. The second alarm item on the user side of the relationship and the following steps, that is, in combination with the alarm item on the user side, it is judged whether to send alarm information for the first alarm item, so as to reduce the false alarm rate.

如果第一告警项的告警类型为设定的高灵敏类型,则可以直接执行判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项及其以下的步骤,即结合用户侧的告警项,判断是否发出针对第一告警项的告警信息,以对是否发出告警信息进行及时判断,在降低误报率的同时可以提高告警效率。If the alarm type of the first alarm item is the set high-sensitivity type, the steps of judging whether there is currently a second alarm item on the user side associated with the first alarm item and the following steps can be directly performed, that is, combining with the user side The alarm item of the first alarm item is judged whether to send the alarm information for the first alarm item, so as to judge whether the alarm information is sent out in time, which can improve the alarm efficiency while reducing the false alarm rate.

在本发明的一个实施例中,在第一告警项的告警类型为设定的低灵敏类型的情况下,该方法还可以包括以下步骤:In an embodiment of the present invention, when the alarm type of the first alarm item is the set low-sensitivity type, the method may further include the following steps:

第一个步骤:将第一告警项的告警状态标记为预恢复状态;The first step: marking the alarm state of the first alarm item as a pre-recovery state;

第二个步骤:如果在第一告警项的自愈允许等待时长内,第一告警项对应的告警事件被成功恢复,则将第一告警项的告警状态由预恢复状态更新为恢复状态;Step 2: if the alarm event corresponding to the first alarm item is successfully recovered within the allowable waiting time for self-healing of the first alarm item, update the alarm state of the first alarm item from the pre-recovery state to the recovery state;

第三个步骤:如果在第一告警项的自愈允许等待时长内,第一告警项对应的告警事件未被成功恢复,则将第一告警项的告警状态由预恢复状态更新为故障状态,并在发出针对第一告警项的告警信息后检测到第一告警项对应的告警事件被成功处理的情况下,将第一告警项的告警状态由故障状态更新为处理完成状态。Step 3: If the alarm event corresponding to the first alarm item is not successfully recovered within the allowable waiting time for self-healing of the first alarm item, update the alarm state of the first alarm item from the pre-recovery state to the fault state, And when it is detected that the alarm event corresponding to the first alarm item has been successfully processed after the alarm information for the first alarm item is sent, the alarm state of the first alarm item is updated from the fault state to the processing completed state.

为方便描述,将上述三个步骤结合起来进行说明。For the convenience of description, the above three steps are combined for description.

在本发明实施例中,在检测到管理侧的第一告警项的情况下,如果确定第一告警项的告警类型为低灵敏类型,则可以将第一告警项的告警状态标记为预恢复状态,然后开始计时,按照设定时间间隔轮训查询,以判断在第一告警项的自愈允许等待时长内,第一告警项的告警事件是否被成功恢复。In this embodiment of the present invention, when the first alarm item on the management side is detected, if it is determined that the alarm type of the first alarm item is a low-sensitivity type, the alarm state of the first alarm item may be marked as a pre-recovery state , and then start timing, and rotate the query according to the set time interval to determine whether the alarm event of the first alarm item is successfully recovered within the allowable waiting time for self-healing of the first alarm item.

如果在第一告警项的自愈允许等待时长内,第一告警项对应的告警事件被成功恢复,则可以将第一告警项的告警状态由预恢复状态更新为恢复状态。以根据告警项的告警状态判断是否进行进一步的操作,如结合用户侧的告警项判断是否发出告警信息。If the alarm event corresponding to the first alarm item is successfully recovered within the allowable waiting time for self-healing of the first alarm item, the alarm state of the first alarm item may be updated from the pre-recovery state to the recovery state. To determine whether to perform further operations according to the alarm status of the alarm item, for example, to determine whether to send alarm information in combination with the alarm item on the user side.

如果在第一告警项的自愈允许等待时长内,第一告警项对应的告警事件未被成功恢复,则可以认为第一告警项对应的告警事件无法自愈,可以将第一告警项的告警状态由预恢复状态更新为故障状态,结合用户侧的告警项判断是否发出告警信息。并在发出针对第一告警项的告警信息后,如果检测到第一告警项对应的告警事件被成功处理,则可以将第一告警项的告警状态由故障状态更新为处理完成状态。If the alarm event corresponding to the first alarm item is not successfully recovered within the allowable waiting time for self-healing of the first alarm item, it can be considered that the alarm event corresponding to the first alarm item cannot heal itself, and the alarm of the first alarm item can be set to The state is updated from the pre-recovery state to the fault state, and it is judged whether to send out alarm information in combination with the alarm items on the user side. And after sending the alarm information for the first alarm item, if it is detected that the alarm event corresponding to the first alarm item is successfully processed, the alarm state of the first alarm item can be updated from the fault state to the processing completed state.

也就是说,本发明实施例中告警状态具有多元性,可以包括预恢复状态、恢复状态、故障状态、处理完成状态。如(pre_recovery,recovery,problem,OK),其中,不会针对预恢复状态pre_recovery和恢复状态recovery的告警项发出告警信息,可以记录在日志中,针对故障状态problem的告警项可以在进一步确定后发出告警信息,处理完成后告警项的告警状态为OK。That is to say, the alarm state in this embodiment of the present invention has diversity, and may include a pre-recovery state, a recovery state, a fault state, and a processing completion state. For example, (pre_recovery, recovery, problem, OK), in which the alarm information will not be issued for the alarm items of the pre-recovery state pre_recovery and the recovery state of recovery, which can be recorded in the log, and the alarm items for the fault state of the problem can be sent after further confirmation. Alarm information. After the processing is complete, the alarm status of the alarm item is OK.

即高灵敏类型的告警项的告警状态更新过程可以是:problem–>成功处理后->OK;That is, the alarm status update process of a high-sensitivity type of alarm item can be: problem->after successful processing->OK;

低灵敏类型的告警项的告警状态更新过程可以有以下两种:pre_recovery->自愈允许等待时长内被成功恢复->recovery;pre_recovery->自愈允许等待时长内未被成功恢复->problem–>成功处理后->OK。There are two types of alarm status update processes for low-sensitivity alarm items: pre_recovery->self-healing allowable waiting time to be successfully recovered->recovery; pre_recovery->self-healing allowable waiting time to be unsuccessfully restored->problem– > After successful processing -> OK.

通过对告警项的告警状态的多元处理,有助于准确判断是否要针对告警项发出告警信息。Through the multi-dimensional processing of the alarm status of the alarm item, it is helpful to accurately determine whether the alarm information should be issued for the alarm item.

为方便理解,以图3所示具体实现过程为例对本发明实施例再次进行说明。For ease of understanding, the embodiment of the present invention will be described again by taking the specific implementation process shown in FIG. 3 as an example.

在有管理侧的告警项时,先确定该告警项的告警类型。When there is an alarm item on the management side, first determine the alarm type of the alarm item.

如果为高灵敏类型,则直接格式化告警数据,转入告警控制处理过程。If it is a highly sensitive type, the alarm data is directly formatted and transferred to the alarm control process.

如果为低灵敏类型,则进行预恢复处理,将告警状态标记为预恢复状态,判断是否被成功恢复。如果被成功恢复,则可以将告警状态标记为恢复状态,并结束该流程。如果未被成功恢复,则可以判断是否超时,即达到自愈允许等待时长。如果未超时,则可以继续判断是否被成功恢复。如果超时,则可以格式化告警数据,转入告警控制处理过程。If it is a low-sensitivity type, perform pre-recovery processing, mark the alarm state as a pre-recovery state, and determine whether it has been successfully recovered. If successfully recovered, the alarm state may be marked as the recovery state, and the process ends. If it is not successfully recovered, it can be judged whether it has timed out, that is, the allowable waiting time for self-healing has been reached. If it does not time out, you can continue to judge whether it has been successfully recovered. If it times out, the alarm data can be formatted and transferred to the alarm control process.

在有用户侧的告警项时,直接格式化告警数据,转入告警控制处理过程。When there is an alarm item on the user side, the alarm data is directly formatted and transferred to the alarm control process.

在告警控制处理过程中,判断当前是否存在与管理侧的告警项有关联关系的用户侧的告警项。如果存在,则根据管理侧的告警项的告警数据和用户侧的告警项的告警数据之间的关系,判断是否发出告警信息,并在确定发出告警信息时,通过语音、邮件、文字等方式发出告警信息。During the alarm control process, it is determined whether there is currently an alarm item on the user side that is associated with the alarm item on the management side. If there is, according to the relationship between the alarm data of the alarm item on the management side and the alarm data of the alarm item on the user side, determine whether to send the alarm information, and when it is determined to send the alarm information, send out the alarm information by voice, email, text, etc. Warning information.

本发明实施例通过管理侧的告警项与用户侧的告警项的结合,判断是否发出告警信息,可以过滤掉可自愈的告警项,降低告警信息的误报率,提升了集群及其服务的稳定性。In the embodiment of the present invention, by combining the alarm items on the management side and the alarm items on the user side, it is judged whether to send out alarm information, and the self-healing alarm items can be filtered out, the false alarm rate of the alarm information is reduced, and the reliability of the cluster and its services is improved. stability.

相应于上面的方法实施例,本发明实施例还提供了一种集群告警控制装置,下文描述的集群告警控制装置与上文描述的集群告警控制方法可相互对应参照。Corresponding to the above method embodiments, the embodiments of the present invention further provide a cluster alarm control apparatus, and the cluster alarm control apparatus described below and the cluster alarm control method described above may refer to each other correspondingly.

参见图4所示,该装置可以包括以下模块:Referring to Figure 4, the device may include the following modules:

关联告警项是否存在判断模块410,用于在集群工作过程中,检测到管理侧的第一告警项的情况下,判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项;The judgment module 410 is configured to determine whether there is a second alarm item on the user side that is associated with the first alarm item when the first alarm item on the management side is detected during the cluster operation process. ;

告警信息是否发出判断模块420,用于在当前存在第二告警项的情况下,根据第一告警项的告警数据和第二告警项的告警数据之间的关系,判断是否发出针对第一告警项的告警信息。The judgment module 420 for whether the alarm information is sent is configured to judge whether to send out the first alarm item according to the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item when the second alarm item currently exists alarm information.

应用本发明实施例所提供的装置,在检测到管理侧的第一告警项的情况下,并不是直接发出针对第一告警项的告警信息,而是先判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项,如果存在,则认为管理侧的第一告警项对应的告警事件可能已经影响到用户侧的使用,然后根据第一告警项的告警数据和第二告警项的告警数据之间的关系,判断是否发出针对第一告警项的告警信息,也就是说,通过管理侧的告警项和用户侧的告警项的结合,判断是否发出告警信息,可以提高告警信息的准确性,降低告警信息的误报率,可以有效避免耗费较多不必要的人力物力成本。By applying the device provided by the embodiment of the present invention, when the first alarm item on the management side is detected, the alarm information for the first alarm item is not directly sent, but it is first judged whether there is currently a problem with the first alarm item. If the second alarm item on the user side of the associated relationship exists, it is considered that the alarm event corresponding to the first alarm item on the management side may have affected the use of the user side. Then, according to the alarm data of the first alarm item and the second alarm item The relationship between the alarm data and the alarm data, determine whether to send the alarm information for the first alarm item, that is to say, through the combination of the alarm item on the management side and the alarm item on the user side, judging whether to send the alarm information can improve the accuracy of the alarm information. Accuracy, reduce the false alarm rate of alarm information, and can effectively avoid unnecessary labor and material costs.

在本发明的一种具体实施方式中,告警信息是否发出判断模块420,用于:In a specific embodiment of the present invention, the judging module 420 for whether the alarm information is sent is used for:

根据第一告警项的告警数据和第二告警项的告警数据,确定第一告警项的告警等级与第二告警项的告警等级的大小关系;According to the alarm data of the first alarm item and the alarm data of the second alarm item, determine the magnitude relationship between the alarm level of the first alarm item and the alarm level of the second alarm item;

如果第二告警项的告警等级大于或等于第一告警项的告警等级,则确定发出针对第一告警项的告警信息。If the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, it is determined to send alarm information for the first alarm item.

在本发明的一种具体实施方式中,告警信息是否发出判断模块420,还用于:In a specific embodiment of the present invention, the judging module 420 for whether the alarm information is sent is also used for:

在第二告警项的告警等级大于或等于第一告警项的告警等级的情况下,在确定发出针对第一告警项的告警信息之前,根据第一告警项的告警数据和第二告警项的告警数据,确定第二告警项的管理侧关联告警项集合与第一告警项的包含关系,第二告警项的管理侧关联告警项集合包括预先设定的与第二告警项具有关联关系的管理侧的告警项;In the case that the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, before it is determined to issue the alarm information for the first alarm item, the alarm data of the first alarm item and the alarm of the second alarm item are determined according to the alarm data of the first alarm item. data, to determine the inclusion relationship between the management-side associated alarm item set of the second alarm item and the first alarm item, and the management-side associated alarm item set of the second alarm item includes a preset management-side associated relationship with the second alarm item. the alarm item;

如果第二告警项的管理侧关联告警项集合包含第一告警项,则执行确定发出针对第一告警项的告警信息的步骤。If the management-side associated alarm item set of the second alarm item includes the first alarm item, the step of determining to issue alarm information for the first alarm item is performed.

在本发明的一种具体实施方式中,还包括:In a specific embodiment of the present invention, it also includes:

告警项是否相同判断模块,用于在当前不存在第二告警项的情况下,判断在检测到第一告警项时刻之前的判断等待时长内,是否存在与第一告警项对应的告警事件相同的未处理完成的告警项;Whether the alarm items are the same or not is a judging module for judging whether there is an alarm event that is the same as the alarm event corresponding to the first alarm item within the judgment waiting time period before the moment when the first alarm item is detected when the second alarm item does not currently exist. Unprocessed alarm items;

告警项处理模块,用于在不存在与第一告警项对应的告警事件相同的未处理完成的告警项的情况下,忽略第一告警项。The alarm item processing module is configured to ignore the first alarm item when there is no unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item.

在本发明的一种具体实施方式中,告警信息是否发出判断模块420,还用于:In a specific embodiment of the present invention, the judging module 420 for whether the alarm information is sent is also used for:

在存在与第一告警项对应的告警事件相同的未处理完成的告警项的情况下,根据与第一告警项对应的告警事件相同的未处理完成的告警项的数量,判断是否发出针对第一告警项的告警信息。In the case that there is an unprocessed alarm item that is the same as the alarm event corresponding to the first alarm item, according to the number of the same unprocessed alarm items as the alarm event corresponding to the first alarm item, it is judged whether to issue an alarm for the first alarm item. Alarm information of the alarm item.

在本发明的一种具体实施方式中,告警信息是否发出判断模块420,用于:In a specific embodiment of the present invention, the judging module 420 for whether the alarm information is sent is used for:

在与第一告警项对应的告警事件相同的未处理完成的告警项的数量在设定的第一数量范围内的情况下,忽略第一告警项;In the case that the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item is within the set first number range, ignore the first alarm item;

在与第一告警项对应的告警事件相同的未处理完成的告警项的数量在设定的第二数量范围内的情况下,确定发出针对第一告警项的告警信息;In the case that the number of unprocessed alarm items that are the same as the alarm event corresponding to the first alarm item is within the set second number range, determine to send alarm information for the first alarm item;

第一数量范围的上限值小于第二数量范围的下限值。The upper limit value of the first quantity range is smaller than the lower limit value of the second quantity range.

在本发明的一种具体实施方式中,还包括:In a specific embodiment of the present invention, it also includes:

告警类型确定模块,用于在检测到管理侧的第一告警项的情况下,在判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项之前,确定第一告警项的告警类型;The alarm type determination module is configured to, when the first alarm item on the management side is detected, before judging whether there is currently a second alarm item on the user side that is associated with the first alarm item, determine the type of the first alarm item alarm type;

恢复与否判断模块,用于在第一告警项的告警类型为设定的低灵敏类型的情况下,判断在达到第一告警项的自愈允许等待时长时,第一告警项对应的告警事件是否被成功恢复;如果第一告警项对应的告警事件未被成功恢复,则触发关联告警项是否存在判断模块410执行判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项的步骤。A recovery or not judgment module, configured to judge the alarm event corresponding to the first alarm item when the allowable waiting time for self-healing of the first alarm item is reached when the alarm type of the first alarm item is the set low-sensitivity type Whether the alarm event is successfully recovered; if the alarm event corresponding to the first alarm item is not successfully recovered, trigger the associated alarm item existence judgment module 410 to determine whether there is currently a user-side second alarm item that is associated with the first alarm item. A step of.

在本发明的一种具体实施方式中,还包括告警状态标记模块,用于:In a specific embodiment of the present invention, it also includes an alarm state marking module, which is used for:

在第一告警项的告警类型为设定的低灵敏类型的情况下,将第一告警项的告警状态标记为预恢复状态;When the alarm type of the first alarm item is the set low-sensitivity type, marking the alarm state of the first alarm item as a pre-recovery state;

如果在第一告警项的自愈允许等待时长内,第一告警项对应的告警事件被成功恢复,则将第一告警项的告警状态由预恢复状态更新为恢复状态;If the alarm event corresponding to the first alarm item is successfully recovered within the allowable waiting time for self-healing of the first alarm item, the alarm state of the first alarm item is updated from the pre-recovery state to the recovery state;

如果在第一告警项的自愈允许等待时长内,第一告警项对应的告警事件未被成功恢复,则将第一告警项的告警状态由预恢复状态更新为故障状态,并在发出针对第一告警项的告警信息后检测到第一告警项对应的告警事件被成功处理的情况下,将第一告警项的告警状态由故障状态更新为处理完成状态。If the alarm event corresponding to the first alarm item is not successfully recovered within the allowable waiting time for self-healing of the first alarm item, the alarm state of the first alarm item is updated from the pre-recovery state to the fault state, and an alarm for the first alarm item is issued When it is detected that the alarm event corresponding to the first alarm item has been successfully processed after the alarm information of an alarm item, the alarm state of the first alarm item is updated from the fault state to the processing completed state.

本发明实施例还提供了一种电子设备,如图5所示,包括处理器501、通信接口502、存储器503和通信总线504,其中,处理器501,通信接口502,存储器503通过通信总线504完成相互间的通信,其中,An embodiment of the present invention further provides an electronic device, as shown in FIG. 5 , including a processor 501 , a communication interface 502 , a memory 503 and a communication bus 504 , wherein the processor 501 , the communication interface 502 , and the memory 503 pass through the communication bus 504 complete communication with each other, among which,

存储器503,用于存放计算机程序;a memory 503 for storing computer programs;

处理器501,用于执行存储器503上所存放的程序时,实现如下步骤:When the processor 501 is used to execute the program stored in the memory 503, the following steps are implemented:

在集群工作过程中,检测到管理侧的第一告警项的情况下,判断当前是否存在与第一告警项有关联关系的用户侧的第二告警项;During the cluster operation, when the first alarm item on the management side is detected, determine whether there is currently a second alarm item on the user side that is associated with the first alarm item;

如果当前存在第二告警项,则根据第一告警项的告警数据和第二告警项的告警数据之间的关系,判断是否发出针对第一告警项的告警信息。If the second alarm item currently exists, according to the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item, it is determined whether to send alarm information for the first alarm item.

上述终端提到的通信总线可以是外设部件互连标准(Peripheral ComponentInterconnect,简称PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture,简称EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned by the above terminal may be a Peripheral Component Interconnect (PCI for short) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA for short) bus or the like. The communication bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

通信接口用于上述终端与其他设备之间的通信。The communication interface is used for communication between the above-mentioned terminal and other devices.

存储器可以包括随机存取存储器(Random Access Memory,简称RAM),也可以包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(Application SpecificIntegrated Circuit,简称ASIC)、现场可编程门阵列(Field-Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor may be a general-purpose processor, including a central processing unit (CPU for short), a network processor (NP for short), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short) , Application Specific Integrated Circuit (ASIC for short), Field-Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.

在本发明提供的又一实施例中,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述实施例中任一的集群告警控制方法。In yet another embodiment provided by the present invention, a computer-readable storage medium is also provided, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium is run on a computer, the computer is made to execute any one of the above-mentioned embodiments. cluster alarm control method.

在本发明提供的又一实施例中,还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例中任一的集群告警控制方法。In yet another embodiment provided by the present invention, there is also provided a computer program product including instructions, which, when running on a computer, causes the computer to execute any of the cluster alarm control methods in the foregoing embodiments.

在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.

需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the partial descriptions of the method embodiments.

以上所述仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (11)

1. A cluster alarm control method is characterized by comprising the following steps:
in the cluster working process, under the condition that a first alarm item on a management side is detected, judging whether a second alarm item on a user side which has a relationship with the first alarm item exists or not at present;
and if the second alarm item exists currently, judging whether to send out alarm information aiming at the first alarm item according to the relation between the alarm data of the first alarm item and the alarm data of the second alarm item.
2. The method according to claim 1, wherein the determining whether to issue the alarm information for the first alarm item according to the relationship between the alarm data of the first alarm item and the alarm data of the second alarm item includes:
determining the magnitude relation between the alarm level of the first alarm item and the alarm level of the second alarm item according to the alarm data of the first alarm item and the alarm data of the second alarm item;
and if the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, determining to send out alarm information aiming at the first alarm item.
3. The method according to claim 2, wherein, in a case that the alarm level of the second alarm item is greater than or equal to the alarm level of the first alarm item, before the determining to issue the alarm information for the first alarm item, the method further comprises:
determining the inclusion relationship between the management side associated alarm item set of the second alarm item and the first alarm item according to the alarm data of the first alarm item and the alarm data of the second alarm item, wherein the management side associated alarm item set of the second alarm item comprises preset management side alarm items having association relationship with the second alarm item;
and if the management side associated alarm item set of the second alarm item contains the first alarm item, executing the step of determining to send out the alarm information aiming at the first alarm item.
4. The method according to claim 1, wherein in case that the second alarm item does not exist currently, the method further comprises:
judging whether unprocessed alarm items identical to the alarm events corresponding to the first alarm items exist in the judgment waiting time before the moment of detecting the first alarm items;
and if the unprocessed and finished alarm items which are the same as the alarm events corresponding to the first alarm item do not exist, ignoring the first alarm item.
5. The method according to claim 4, wherein in case that there is an unprocessed completed alarm item that is the same as the alarm event corresponding to the first alarm item, further comprising:
and judging whether to send out alarm information aiming at the first alarm item according to the number of the unprocessed alarm items which are the same as the alarm events corresponding to the first alarm item.
6. The method according to claim 5, wherein the determining whether to issue the alarm information for the first alarm item according to the number of unprocessed completed alarm items that are the same as the alarm event corresponding to the first alarm item includes:
if the number of the unprocessed finished alarm items which are the same as the alarm events corresponding to the first alarm item is within a set first number range, ignoring the first alarm item;
if the number of the unprocessed alarm items which are the same as the alarm events corresponding to the first alarm item is within a set second number range, determining to send out alarm information aiming at the first alarm item;
an upper value of the first range of numbers is less than a lower value of the second range of numbers.
7. The method according to any one of claims 1 to 6, wherein, in a case where a first alarm item on a management side is detected, before the determining whether a second alarm item on a user side having an association relationship with the first alarm item currently exists, the method further comprises:
determining an alarm type of the first alarm item;
if the alarm type of the first alarm item is a set low-sensitivity type, judging whether the alarm event corresponding to the first alarm item is successfully recovered when the self-healing allowable waiting time of the first alarm item is reached;
and if the alarm event corresponding to the first alarm item is not successfully recovered, executing the step of judging whether a second alarm item at the user side having the association relation with the first alarm item exists currently.
8. The method according to claim 7, wherein in case that the alarm type of the first alarm item is a set low sensitivity type, the method further comprises:
marking the alarm state of the first alarm item as a pre-recovery state;
if the alarm event corresponding to the first alarm item is successfully recovered within the self-healing allowable waiting time of the first alarm item, updating the alarm state of the first alarm item from the pre-recovery state to a recovery state;
if the alarm event corresponding to the first alarm item is not successfully recovered within the self-healing allowable waiting time of the first alarm item, updating the alarm state of the first alarm item from the pre-recovery state to a fault state, and updating the alarm state of the first alarm item from the fault state to a processing completion state under the condition that the alarm event corresponding to the first alarm item is successfully processed after the alarm information aiming at the first alarm item is sent out.
9. A cluster alarm control device, comprising:
the device comprises a correlation alarm item existence judging module, a cluster management module and a judgment module, wherein the correlation alarm item existence judging module is used for judging whether a second alarm item at a user side which has a correlation with a first alarm item exists at present under the condition that the first alarm item at a management side is detected in the cluster working process;
and the alarm information sending judging module is used for judging whether to send the alarm information aiming at the first alarm item according to the relation between the alarm data of the first alarm item and the alarm data of the second alarm item under the condition that the second alarm item exists currently.
10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the cluster alarm control method of any of claims 1-8 when executing a program stored on a memory.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a cluster alarm control method according to any one of claims 1-8.
CN202210055230.2A 2022-01-18 2022-01-18 Cluster alarm control method, device, electronic device and storage medium Pending CN114461506A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210055230.2A CN114461506A (en) 2022-01-18 2022-01-18 Cluster alarm control method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210055230.2A CN114461506A (en) 2022-01-18 2022-01-18 Cluster alarm control method, device, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN114461506A true CN114461506A (en) 2022-05-10

Family

ID=81409153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210055230.2A Pending CN114461506A (en) 2022-01-18 2022-01-18 Cluster alarm control method, device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114461506A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102820995A (en) * 2008-11-18 2012-12-12 华为技术有限公司 Alarm processing method, device and system
WO2019001312A1 (en) * 2017-06-28 2019-01-03 华为技术有限公司 Method and apparatus for realizing alarm association, and computer readable storage medium
CN109829833A (en) * 2018-12-07 2019-05-31 国网浙江省电力有限公司 Power distribution network alarm method and system, electronic equipment and computer readable storage medium
CN110650036A (en) * 2019-08-30 2020-01-03 中国人民财产保险股份有限公司 Alarm processing method and device and electronic equipment
CN111555899A (en) * 2020-02-18 2020-08-18 远景智能国际私人投资有限公司 Alarm rule configuration method, equipment state monitoring method, device and storage medium
WO2020238810A1 (en) * 2019-05-25 2020-12-03 华为技术有限公司 Alarm analysis method and related device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102820995A (en) * 2008-11-18 2012-12-12 华为技术有限公司 Alarm processing method, device and system
WO2019001312A1 (en) * 2017-06-28 2019-01-03 华为技术有限公司 Method and apparatus for realizing alarm association, and computer readable storage medium
CN109829833A (en) * 2018-12-07 2019-05-31 国网浙江省电力有限公司 Power distribution network alarm method and system, electronic equipment and computer readable storage medium
WO2020238810A1 (en) * 2019-05-25 2020-12-03 华为技术有限公司 Alarm analysis method and related device
CN110650036A (en) * 2019-08-30 2020-01-03 中国人民财产保险股份有限公司 Alarm processing method and device and electronic equipment
CN111555899A (en) * 2020-02-18 2020-08-18 远景智能国际私人投资有限公司 Alarm rule configuration method, equipment state monitoring method, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
于志华: ""面向CPE设备的综合终端管理系统的设计与实现"", 《中国优秀硕士学位论文全文数据库(电子期刊)》, no. 2014, 15 March 2014 (2014-03-15) *

Similar Documents

Publication Publication Date Title
CN110224858B (en) Log-based alarm method and related device
CN106713007A (en) Alarm monitoring system and alarm monitoring method and device for server
CN108038039B (en) Method for recording log and micro-service system
CN113495820A (en) Method and device for collecting and processing abnormal information and abnormal monitoring system
CN110955581A (en) Online software abnormity warning method and device, electronic equipment and storage medium
WO2021174684A1 (en) Cutover information processing method, system and apparatus
CN110362435B (en) PCIE fault positioning method, device, equipment and medium for Purley platform server
CN106095638A (en) The method of a kind of server resource alarm, Apparatus and system
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
CN111367934A (en) Data consistency checking method, device, server and medium
CN111327466A (en) An alarm analysis method, system, device and medium
CN107612755A (en) The management method and its device of a kind of cloud resource
CN118626345A (en) Method, device, storage medium and electronic device for service abnormality alarm and positioning
US20230359514A1 (en) Operation-based event suppression
CN114461506A (en) Cluster alarm control method, device, electronic device and storage medium
CN116483663A (en) Abnormality warning method and device for platform
CN115296979B (en) Fault processing method, device, equipment and storage medium
CN108390770B (en) Information generation method and device and server
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment
CN115733740A (en) Log detection method and device, computer equipment and computer readable storage medium
CN115941443A (en) A message queue-based service exception alarm method and system
CN115580522A (en) Method and device for monitoring running state of container cloud platform
CN114780378A (en) System stability detection traceability method and related equipment based on business interface
WO2020147415A1 (en) Snapshot service process management method and apparatus, electronic device, and readable storage medium
CN112069027A (en) Interface data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination