CN117290133A

CN117290133A - Abnormal event processing method, electronic device and storage medium

Info

Publication number: CN117290133A
Application number: CN202210678899.7A
Authority: CN
Inventors: 姜磊; 罗秋野; 文秀林; 孟照星
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2023-12-26
Also published as: WO2023241484A1

Abstract

The embodiment of the invention discloses an abnormal event processing method, electronic equipment and a storage medium, wherein the abnormal event processing method comprises the following steps: acquiring a plurality of abnormal events of a target position in a preset time period, wherein the abnormal events comprise at least one of alarms, key performance index anomalies and operation logs; determining an aggregation point in the abnormal event; and polymerizing according to the polymerization point and the abnormal event to obtain a polymerization result. The embodiment of the invention can carry out aggregation according to the aggregation point, the obtained aggregation result can carry out root cause analysis, and because a plurality of abnormal events are acquired in a period of time, the abnormal events required by the previous aggregation can be carried out to the time node of the aggregation point according to the time node, the position and the like of the aggregation point during aggregation, so that the embodiment of the invention can realize bidirectional aggregation during aggregation, aggregate other events which are possibly the root cause of the fault, improve the aggregation capability of a data source and improve the fault operation and maintenance level.

Description

Abnormal event handling methods, electronic equipment and storage media

技术领域Technical field

本发明涉及但不限于通信技术领域，特别是涉及一种异常事件处理方法、电子设备及存储介质。The present invention relates to but is not limited to the field of communication technology, and in particular, to an abnormal event processing method, electronic equipment and storage media.

背景技术Background technique

随着移动通信技术的发展，网络复杂化、应用多样性、数据爆炸导致对智能运维的要求与日俱增。对相关的流式数据进行聚合后再进行分析是识别故障根因主要手段，然而，相关技术中，在对数据源进行聚合的时候，往往只能在告警先发生后，再以告警往后聚合，如果是告警之前事先由于某种操作引发了告警，这样的聚合则无法明确故障根因，因此聚合能力低，导致故障运维水平低下。With the development of mobile communication technology, network complexity, application diversity, and data explosion have led to increasing requirements for intelligent operation and maintenance. Aggregating relevant streaming data and then analyzing it is the main means to identify the root cause of a fault. However, in related technologies, when aggregating data sources, it is often only possible to aggregate the alarms after the alarms occur first. , if the alarm is triggered by some operation in advance, such aggregation cannot clarify the root cause of the fault, so the aggregation capability is low, resulting in low fault operation and maintenance levels.

发明内容Contents of the invention

本发明实施例提供了一种异常事件处理方法、电子设备及存储介质，实现双向聚合，能够提高数据源的聚合能力，提高故障运维水平。Embodiments of the present invention provide an abnormal event processing method, an electronic device, and a storage medium to realize bidirectional aggregation, which can improve the aggregation capability of data sources and improve the level of fault operation and maintenance.

第一方面，本发明实施例提供了一种异常事件处理方法，所述方法包括：在预设时间段内获取目标位置的多个异常事件，所述异常事件包括告警、关键性能指标异常和操作日志中的至少一种；在所述异常事件中确定聚合点；根据所述聚合点和所述异常事件进行聚合，得到聚合结果。In a first aspect, embodiments of the present invention provide an abnormal event processing method. The method includes: acquiring multiple abnormal events at a target location within a preset time period. The abnormal events include alarms, key performance indicator exceptions, and operation At least one of the logs; determining an aggregation point in the abnormal event; performing aggregation according to the aggregation point and the abnormal event to obtain an aggregation result.

第二方面，本发明实施例提供了一种电子设备，包括：存储器、处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现如本发明第一方面实施例中任意一项所述的异常事件处理方法。In a second aspect, an embodiment of the present invention provides an electronic device, including: a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the steps in the first embodiment of the present invention. Any of the above exception event handling methods.

第三方面，本发明实施例提供了一种计算机可读存储介质，所述存储介质存储有程序，所述程序被处理器执行实现如本发明第一方面实施例中任意一项所述的异常事件处理方法。In a third aspect, embodiments of the present invention provide a computer-readable storage medium, the storage medium stores a program, and the program is executed by a processor to implement an exception as described in any one of the embodiments of the first aspect of the present invention. Event handling methods.

本发明实施例至少包括以下有益效果：本发明实施例中的异常事件处理方法、电子设备及存储介质，通过执行异常事件处理方法，可以在预设时段段内不断获取目标位置的多个异常事件，目标位置是空间上一条链路、一个网元或者一个机房，异常事件包括告警、关键性能指标异常和操作日志中的至少一种，实现了多数据源的获取，随后在异常事件中确定聚合点，聚合点可以为其中的任意一个标定的异常事件，在聚合的时候，本发明实施例可以根据聚合点进行聚合，根据聚合点和异常事件进行聚合得到聚合结果，以便进行根因分析，由于是在一段时间内获取的多个异常事件，在聚合的时候，根据聚合点所处的时间节点和位置等可以向该聚合点的时间节点以前聚合所需要的异常事件，使得本发明实施例在聚合的时候，不仅可以向后聚合，还可以向前聚合，将其他可能为故障根因的事件聚合起来，实现双向聚合，能够提高数据源的聚合能力，提高故障运维水平。Embodiments of the present invention at least include the following beneficial effects: the abnormal event processing method, electronic device and storage medium in the embodiment of the present invention, by executing the abnormal event processing method, can continuously obtain multiple abnormal events at the target location within a preset period of time. , the target location is a link, a network element or a computer room in space. Abnormal events include at least one of alarms, key performance indicator exceptions and operation logs. This enables the acquisition of multiple data sources, and then determines the aggregation in the abnormal events. point, the aggregation point can be any one of the calibrated abnormal events. During aggregation, the embodiment of the present invention can perform aggregation based on the aggregation point, and perform aggregation based on the aggregation point and the abnormal event to obtain an aggregation result for root cause analysis. It is a plurality of abnormal events obtained within a period of time. During aggregation, the required abnormal events can be aggregated to the time node of the aggregation point according to the time node and location of the aggregation point, so that the embodiment of the present invention can When aggregating, not only backward aggregation can be performed, but also forward aggregation can be performed to aggregate other events that may be the root cause of the fault to achieve two-way aggregation, which can improve the aggregation capabilities of data sources and improve the level of fault operation and maintenance.

附图说明Description of drawings

图1是本发明一个实施例提供的异常事件处理方法的流程示意图；Figure 1 is a schematic flowchart of an abnormal event processing method provided by an embodiment of the present invention;

图2是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 2 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图3是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 3 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图4是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 4 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图5是本发明一个实施例提供的目标缓存区的示意图；Figure 5 is a schematic diagram of a target cache area provided by an embodiment of the present invention;

图6是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 6 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图7是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 7 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图8是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 8 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图9是本发明一个实施例提供的以通信异常为聚合点进行前后双向聚合的示意图；Figure 9 is a schematic diagram of forward and backward bidirectional aggregation using communication exceptions as aggregation points according to an embodiment of the present invention;

图10是本发明一个实施例提供的以网络不通告警为聚合点进行前后双向聚合的示意图；Figure 10 is a schematic diagram of forward and backward bidirectional aggregation using network unreachable alarms as aggregation points according to an embodiment of the present invention;

图11是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 11 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图12是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 12 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图13是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 13 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图14是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 14 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图15是本发明另一个实施例提供的异常事件处理方法的流程示意图；Figure 15 is a schematic flowchart of an abnormal event processing method provided by another embodiment of the present invention;

图16是本发明一个实施例提供的电子设备的示意图。Figure 16 is a schematic diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.

在本发明的描述中，需要理解的是，涉及到方位描述，例如上、下、前、后、左、右等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明实施例的限制。In the description of the present invention, it should be understood that orientation descriptions, such as up, down, front, back, left, right, etc., are based on the orientation or position relationships shown in the drawings and are only In order to facilitate the description of the present invention and simplify the description, it is not intended to indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the embodiments of the present invention.

应了解，在本发明实施例的描述中，若干的含义为一个以上，多个(或多项)的含义是两个以上，大于、小于、超过等理解为不包括本数，以上、以下、以内等理解为包括本数。如果有描述到“第一”、“第二”等只是用于区分技术特征为目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。It should be understood that in the description of the embodiments of the present invention, several means one or more, plural (or multiple) means two or more, greater than, less than, more than, etc. are understood to exclude the number, above, below, within etc. shall be understood as including the original number. If there are descriptions of "first", "second", etc., they are only used for the purpose of distinguishing technical features and cannot be understood as indicating or implying the relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the indicated technical features. The sequence relationship of technical features.

本发明实施例的描述中，除非另有明确的限定，设置、安装、连接等词语应做广义理解，所属技术领域技术人员可以结合技术方案的具体内容合理确定上述词语在本发明实施例中的具体含义。In the description of the embodiments of the present invention, unless otherwise explicitly limited, words such as setting, installation, and connection should be understood in a broad sense. Those skilled in the art can reasonably determine the meaning of the above words in the embodiments of the present invention based on the specific content of the technical solution. specific meaning.

随着5G新基建的不断推进和发展，网络复杂化，应用多样性，数据爆炸，运营商和设备商在自治网络(Autonomous Network)的“规建维优营”几个方面，对自动化和智能化的诉求与日俱增，其中，“维”即是运维，以故障处理为主，对故障的定位，从单一告警数据源的聚合分析定位演进到了多数据源，如日志、关键性能指标(Key Performance Indicator，KPI)和告警的聚合分析。With the continuous advancement and development of 5G new infrastructure, network complexity, application diversity, and data explosion, operators and equipment vendors are facing increasing challenges in automation and intelligence in the aspects of "planning, construction, maintenance, and operation" of autonomous networks (Autonomous Networks). The demand for automation is increasing day by day. Among them, "dimension" refers to operation and maintenance, which focuses on fault handling. The positioning of faults has evolved from the aggregation analysis and positioning of a single alarm data source to multiple data sources, such as logs, key performance indicators (Key Performance). Indicator, KPI) and alarm aggregate analysis.

聚合分析，是指把相关的流式数据进行聚合后再进行分析识别故障根因。聚合有时间空间两个维度，这个也是所谓的时空聚合。在空间维度聚合，可以利用拓扑资源相关性，如同一个网元或者同一条链路、同一个机房的相关数据；在时间维度聚合，即相关数据在一定时间范围内进行聚合，跟空间维度的聚合不同，时间维度的聚合相对来说比较困难，主要是时间范围并不好确定。Aggregation analysis refers to aggregating relevant streaming data and then analyzing to identify the root cause of the fault. Aggregation has two dimensions: time and space. This is also the so-called spatio-temporal aggregation. Aggregation in the spatial dimension can use topological resource correlation, such as related data of the same network element, the same link, and the same computer room; aggregation in the time dimension, that is, related data is aggregated within a certain time range, which is similar to aggregation in the spatial dimension. Differently, aggregation of the time dimension is relatively difficult, mainly because the time range is not easy to determine.

如在一个空间维度下，在一定时间窗口内，A告警导致B告警，运营商和设备商根据历史数据的统计得来的经验形成了一些规则，下面以三条规则来举例说明，规则基本格式可以为：For example, in a spatial dimension, within a certain time window, alarm A leads to alarm B. Operators and equipment vendors have formed some rules based on the statistical experience of historical data. The following is an example of three rules. The basic format of the rules can be for:

1)第一条规则，同网元，5分钟窗口，射频拉远单元(Remote Radio Unit，RRU)链路误码率高告警和光模块接收光功率异常同时发生，则认为它们两个告警可聚合；1) The first rule is that in the same network element, in a 5-minute window, if the remote radio unit (RRU) link high bit error rate alarm and the optical module receiving optical power abnormality occur at the same time, it is considered that their two alarms can be aggregated ;

1)第二条规则，同网元，10分钟窗口，光模块接收光功率异常和RRU链路断告警同时发生，则认为它们两个告警可聚合；1) The second rule is that in the same network element, in a 10-minute window, if the optical module receives optical power abnormality and the RRU link down alarm occurs at the same time, it is considered that the two alarms can be aggregated;

2)第三条规则，同网元，15分钟窗口，RRU链路断告警和分布式单元(DistributedUnit，DU)小区退服告警，则认为它们两个告警可聚合。2) The third rule is that for the same network element, in a 15-minute window, if the RRU link disconnection alarm and the distributed unit (DistributedUnit, DU) cell outage alarm are considered, the two alarms can be aggregated.

以上面几条规则看，需要把时空维度的RRU链路误码率高告警、光模块接收光功率异常告警、RRU链路断告警和DU小区退服告警聚合在一起，最后找到根因是RRU链路误码率高导致的DU小区退服。Based on the above rules, it is necessary to aggregate RRU link high bit error rate alarms, optical module received optical power abnormality alarms, RRU link disconnection alarms and DU cell outage alarms in the space and time dimension, and finally find out that the root cause is RRU The DU cell is out of service due to high link error rate.

申请人发现，相关技术中，在时间维度上的时间不好确定，需要时间步长的设计思想，不能以几条规则的最大时间15分钟来确定，也不能以所有相关规则的时间加和来确定，不仅如此，这还有一个关键性依赖，即时间维度上完全依赖向后判定，即A告警发生导致B告警发生，那么A告警发生时间会在B告警发生前，那么是以A告警发生后预测B告警发生的时间。The applicant found that in the related technology, the time in the time dimension is difficult to determine, and the design idea of requiring time steps cannot be determined by the maximum time of 15 minutes for several rules, nor can it be determined by the sum of the times of all relevant rules. OK, not only that, but there is also a key dependency, that is, the time dimension completely relies on backward judgment, that is, the occurrence of alarm A causes the occurrence of alarm B, then the occurrence time of alarm A will be before the occurrence of alarm B, then the occurrence of alarm A Then predict the time when alarm B occurs.

申请人发现，这种情况下的聚合会存在很多问题，这种情况下，告警发生时间可能会有差异，甚至可能B告警还在A告警之前，并且，无法聚合多数据源的情况，因为如果只是告警和关键性能指标异常，相对来说，容易明确异常发生时刻，即有明确的异常数据源，告警引发关键性能指标劣化或异常，那么告警先发生，以告警往后聚合，但如果是某种操作引起了相关告警，例如某种操作的日志是在告警之前，这样不方便向后聚合，由于日志不方便明确异常，所以无法即时感知，往往是告警或者关键性能指标异常后，再往前回头找相关日志，如内存泄漏这些故障，是已经发现了内存泄漏或者发现泄漏趋势，再往前找相关日志，这样属于事后聚合。The applicant found that aggregation in this case would have many problems. In this case, the alarm occurrence time may be different, and the B alarm may even be before the A alarm, and multiple data sources cannot be aggregated, because if Only alarms and key performance indicators are abnormal. Relatively speaking, it is easy to determine the time when the exception occurs, that is, there is a clear abnormal data source. If the alarm causes the key performance indicator to deteriorate or be abnormal, then the alarm occurs first and the alarm is aggregated later. But if it is a certain A certain operation causes a related alarm. For example, the log of a certain operation is before the alarm, which is inconvenient to aggregate backwards. Since the log is inconvenient to clearly identify abnormalities, it cannot be perceived immediately. It is often after an alarm or key performance indicator is abnormal, and then go forward. Looking back at relevant logs, such as memory leaks and other faults, means that memory leaks or leak trends have been discovered, and then looking back at relevant logs is a post-event aggregation.

因此，相关技术中的方案存在技术缺陷，如果是告警之前事先由于某种操作引发了告警，这样的聚合则无法明确故障根因，因此聚合能力低，导致故障运维水平低下。Therefore, the solutions in related technologies have technical flaws. If the alarm is triggered by some operation before the alarm, such aggregation cannot clarify the root cause of the fault. Therefore, the aggregation capability is low, resulting in low fault operation and maintenance levels.

基于此，本发明实施例提供了一种异常事件处理方法、电子设备及存储介质，能够实现双向聚合，提高数据源的聚合能力，提高故障运维水平。Based on this, embodiments of the present invention provide an abnormal event processing method, electronic device and storage medium, which can realize two-way aggregation, improve the aggregation capability of data sources, and improve the level of fault operation and maintenance.

下面进行详细说明。Detailed explanation below.

本发明实施例提供了一种异常事件处理方法，参照图1所示，本发明实施例中的异常事件处理方法包括但不限于步骤S101至步骤S103。An embodiment of the present invention provides an abnormal event processing method. Referring to FIG. 1 , the abnormal event processing method in the embodiment of the present invention includes but is not limited to steps S101 to S103.

步骤S101，在预设时间段内获取目标位置的多个异常事件，异常事件包括告警、关键性能指标异常和操作日志中的至少一种。Step S101: Obtain multiple abnormal events at the target location within a preset time period. The abnormal events include at least one of alarms, key performance indicator abnormalities, and operation logs.

步骤S102，在异常事件中确定聚合点。Step S102: Determine the aggregation point in abnormal events.

步骤S103，根据聚合点和异常事件进行聚合，得到聚合结果。Step S103: Aggregate based on aggregation points and abnormal events to obtain aggregation results.

在一实施例中，本发明实施例中的异常事件处理方法可以应用在通信设备中，通过执行异常事件处理方法，能够实现双向聚合，提高数据源的聚合能力，提高故障运维水平。具体的，本发明实施例中可以在预设时间段内获取目标位置的多个异常事件，并在所获取的异常事件中确定聚合点，异常事件包括告警、关键性能指标(KPI)异常和操作日志中的至少一种，在一实施例中，异常事件包括告警或关键性能指标异常中的一种，并且还包括操作日志，又或者，在另一实施例中，异常事件包括告警、关键性能指标异常和操作日志，本发明实施例中以包含上述三者为例子进行说明，聚合点是根据聚合需要设定的点，聚合点的具体类型可以通过配置指定。In one embodiment, the abnormal event processing method in the embodiment of the present invention can be applied in communication equipment. By executing the abnormal event processing method, two-way aggregation can be achieved, the aggregation capability of data sources can be improved, and the level of fault operation and maintenance can be improved. Specifically, in the embodiment of the present invention, multiple abnormal events at the target location can be obtained within a preset time period, and the aggregation point is determined among the obtained abnormal events. The abnormal events include alarms, key performance indicator (KPI) exceptions and operation At least one of the logs. In one embodiment, the abnormal events include one of alarms or key performance indicator exceptions, and also include operation logs. Alternatively, in another embodiment, the abnormal events include alarms, key performance indicator exceptions. Indicator anomalies and operation logs are described in the embodiment of the present invention by taking the above three as an example. The aggregation point is a point set according to the aggregation needs, and the specific type of the aggregation point can be specified through configuration.

本发明实施例根据聚合点和异常事件进行聚合，得到聚合结果，可以理解的是，聚合点是众多异常事件中的其中一个或多个，由于异常事件是在预设时间段内不断获取的，因此聚合点所处的时间位于预设时间段的中间，可以理解的是，以聚合点的聚合，在聚合点的获取时间前后，可以包含多个异常事件，这些前后时间的异常事件可以是告警、关键性能指标异常和操作日志中的至少一种，因此，本发明实施例中，可以在聚合点之前事先由于某种操作引发了告警或关键性能指标异常时，可以聚合到聚合点之前和之后的数据，得到的聚合结果可以用于明确故障根因，实现双向聚合，能够提高数据源的聚合能力，提高故障运维水平。The embodiment of the present invention performs aggregation based on aggregation points and abnormal events to obtain aggregation results. It can be understood that the aggregation point is one or more of many abnormal events. Since abnormal events are continuously acquired within a preset time period, Therefore, the time of the aggregation point is in the middle of the preset time period. It can be understood that the aggregation of the aggregation point can contain multiple abnormal events before and after the acquisition time of the aggregation point. The abnormal events at these times before and after can be alarms. , at least one of key performance indicator anomalies and operation logs. Therefore, in the embodiment of the present invention, when an alarm or key performance indicator anomaly is triggered in advance due to some operation before the aggregation point, it can be aggregated before and after the aggregation point. The data obtained, the aggregation results can be used to clarify the root cause of the fault and achieve two-way aggregation, which can improve the aggregation capabilities of the data source and improve the level of fault operation and maintenance.

需要说明的是，本发明实施例中的预设时间段可以根据实际运维需要设置，例如，预设时间段可以是20分钟、1小时、4小时或者更长时间，从预设时间段的起始时间开始，本发明实施例就可以开始获取作为数据源的异常事件，包括获取告警、关键性能指标异常和操作日志中的至少一种，实现多数据源的获取，本发明实施例中预设时间段内获取数据源，通过设定预设时间段的时间长短可以在时间维度上明确聚合的时间。It should be noted that the preset time period in the embodiment of the present invention can be set according to actual operation and maintenance needs. For example, the preset time period can be 20 minutes, 1 hour, 4 hours or longer. From the preset time period Starting from the starting time, the embodiment of the present invention can start to obtain abnormal events as data sources, including obtaining at least one of alarms, key performance indicator exceptions and operation logs, to achieve the acquisition of multiple data sources. In the embodiment of the present invention, the predetermined The data source is obtained within a set time period. By setting the length of the preset time period, the aggregation time can be clarified in the time dimension.

需要说明的是，本发明实施例中的目标位置是空间维度上的一个位置，例如，目标位置可以为一个网元、一个机房或者是一个链路，通过最终得到的聚合结果，通过聚合分析后可以得到该网元、机房或者链路的故障根因。It should be noted that the target location in the embodiment of the present invention is a location in the spatial dimension. For example, the target location can be a network element, a computer room or a link. Through the final aggregation result, after aggregation analysis The root cause of the fault of the network element, computer room or link can be obtained.

参照图2所示，在一实施例中，上述步骤S101中还可以包括但不限于步骤S201和步骤S202。Referring to FIG. 2 , in an embodiment, the above step S101 may also include but is not limited to step S201 and step S202.

步骤S201，根据预设时间段的总时长在目标缓存区中建立多个时间桶，其中，时间桶由时间戳区间构成，各个时间桶的时长相同且相邻的两个时间桶的时间连续。Step S201: Create multiple time buckets in the target cache area according to the total duration of the preset time period. The time buckets are composed of timestamp intervals. The duration of each time bucket is the same and the time of two adjacent time buckets is continuous.

步骤S202，连续获取目标位置的多个异常事件，并按照各个异常事件的获取时间缓存在对应时间的时间桶中。Step S202: Continuously acquire multiple abnormal events at the target location and cache them in the time bucket corresponding to the acquisition time of each abnormal event.

在一实施例中，本发明实施例通过设置时间桶的实现对异常事件的缓存方式，具体的，本发明实施例通过根据预设时间段的总时长在目标缓存区中建立多个时间桶，时间桶由时间戳区间构成，各个时间桶的时长相同且相邻的两个时间桶的时间连续，并在连续获取目标位置的多个异常事件时，按照各个异常事件的获取时间缓存在对应时间的时间桶中，实现数据缓存，目标缓存区为与目标位置缓存对应的缓存区，一个目标位置可以对应多个缓存区，或对应一个一一对应的缓存区，本发明实施例用双向时间维度进行聚合，每个缓存区按照时间戳和一定时间区间当作时间桶的方式来缓存异常事件，因此，实现了在缓存异常事件后，并不要立即聚合，还需要等待一定时间，等到时间桶缓存完毕，才开始准备聚合。In one embodiment, the embodiment of the present invention realizes the caching method of abnormal events by setting time buckets. Specifically, the embodiment of the present invention establishes multiple time buckets in the target cache area according to the total duration of the preset time period. Time buckets are composed of timestamp intervals. The duration of each time bucket is the same and the time of two adjacent time buckets is continuous. When multiple abnormal events at the target location are continuously obtained, they are cached at the corresponding time according to the acquisition time of each abnormal event. In the time bucket, data caching is implemented. The target cache area is the cache area corresponding to the target location cache. One target location can correspond to multiple cache areas, or correspond to a one-to-one corresponding cache area. In the embodiment of the present invention, a two-way time dimension is used For aggregation, each cache area caches abnormal events according to the timestamp and a certain time interval as a time bucket. Therefore, after caching the abnormal events, it is not necessary to aggregate immediately. You also need to wait for a certain period of time until the time bucket is cached. After that, start preparations for aggregation.

需要说明的是，在目标缓存区中，各个时间桶的时长相同且相邻的两个时间桶的时间连续，例如，当一个目标缓存区在预设时间段为20分钟内获取异常事件时，可将每个时间桶的时长设定为5分钟，因此可以得到4个连续的时间桶，其中，第一个时间桶的时间从第0分钟缓存到第5分钟，第二个时间桶从第5分钟缓存到第10分钟，第三个时间桶从第10分钟缓存到第15分钟，第四个时间桶从第15分钟缓存到第20分钟，更长的预设时间段可以以此类推，每个时间桶的时长可以根据实际运维需要设置，在此不做具体限制。It should be noted that in the target buffer area, the duration of each time bucket is the same and the times of two adjacent time buckets are continuous. For example, when a target buffer area obtains an abnormal event within the preset time period of 20 minutes, The length of each time bucket can be set to 5 minutes, so you can get 4 consecutive time buckets. Among them, the time of the first time bucket is cached from minute 0 to minute 5, and the time of the second time bucket is cached from minute 0 to minute 5. The cache is cached from 5 minutes to the 10th minute, the third time bucket is cached from the 10th minute to the 15th minute, the fourth time bucket is cached from the 15th minute to the 20th minute, and so on for longer preset time periods. The length of each time bucket can be set according to actual operation and maintenance needs, and there are no specific restrictions here.

可以理解的是，本发明实施例中可以通过设定目标缓存区中时间桶不再缓存的条件，来控制停止缓存异常事件。It can be understood that in the embodiment of the present invention, the stop caching exception event can be controlled by setting the condition that the time bucket in the target cache area is no longer cached.

参照图3所示，在一实施例中，上述步骤S202中还可以包括但不限于步骤S301至步骤S303。Referring to FIG. 3 , in one embodiment, the above step S202 may also include but is not limited to steps S301 to S303.

步骤S301，获取预设时间段内停止缓存异常事件的收敛条件。Step S301: Obtain convergence conditions for stopping caching of abnormal events within a preset time period.

步骤S302，从预设时间段的起始时间开始连续获取目标位置的多个异常事件，并按照各个异常事件的获取时间缓存在对应时间的时间桶中。Step S302: Continuously acquire multiple abnormal events at the target location starting from the start time of the preset time period, and cache them in the time bucket of the corresponding time according to the acquisition time of each abnormal event.

步骤S303，当缓存的异常事件满足收敛条件，停止缓存异常事件。Step S303: When the cached abnormal events meet the convergence condition, stop caching the abnormal events.

在一实施例中，针对目前在时间维度上聚合的时间不好确定的问题，本发明实施例通过设置时间桶的模式来缓存目标位置的异常事件，并通过设定时间桶的收敛条件来控制时间桶停止缓存的时间点，具体的，本发明实施例获取预设时间段内停止缓存异常事件的收敛条件，本发明实施例中在缓存异常事件时，从预设时间段的起始时间开始连续获取目标位置的多个异常事件，并按照各个异常事件的获取时间依次缓存在对应时间的时间桶中，并在缓存的异常事件满足收敛条件时，停止缓存异常事件，当满足收敛条件后，说明目标缓存区缓存完毕，即本缓存区关闭不再接收其它异常事件，把时间桶封装后准备聚合，还可以在收敛条件满足后清除目标缓存区，以待后面的异常事件缓存，本发明实施例中通过设置收敛条件来控制什么时候停止缓存异常事件，可以不用浪费数据收集的时间，在时间维度上提高聚合的效率，提高聚合能力。In one embodiment, in order to solve the current problem that the time of aggregation in the time dimension is difficult to determine, the embodiment of the present invention caches abnormal events at the target location by setting the time bucket mode, and controls it by setting the convergence conditions of the time bucket. The time point when the time bucket stops caching. Specifically, the embodiment of the present invention obtains the convergence condition for stopping the caching of abnormal events within a preset time period. In the embodiment of the present invention, when caching abnormal events, it starts from the starting time of the preset time period. Continuously obtain multiple exception events at the target location, and cache them in the time bucket of the corresponding time according to the acquisition time of each exception event. When the cached exception events meet the convergence conditions, stop caching the exception events. When the convergence conditions are met, It means that the target buffer area is cached, that is, the buffer area is closed and no longer receives other abnormal events. The time bucket is encapsulated and prepared for aggregation. The target buffer area can also be cleared after the convergence conditions are met to wait for subsequent abnormal event caching. The present invention implements In the example, by setting convergence conditions to control when to stop caching exception events, there is no need to waste time on data collection, and the efficiency of aggregation can be improved in the time dimension and the aggregation capability can be improved.

需要说明的是，从聚合点角度来看，本发明实施例中在根据满足收敛条件关闭目标缓存区以停止缓存异常事件时，当前已经缓存的数据中，已经包含了聚合点时间维度上前后双向的异常事件，即聚合点为中心的前后异常事件均已进入目标缓存区中，提高数据的聚合能力，由此可以进行双向聚合。It should be noted that, from the perspective of the aggregation point, in the embodiment of the present invention, when the target cache area is closed to stop caching abnormal events based on meeting the convergence conditions, the currently cached data already contains bidirectional data in the time dimension of the aggregation point. The abnormal events, that is, the abnormal events before and after the aggregation point as the center have entered the target cache area, which improves the aggregation ability of the data, so that two-way aggregation can be performed.

参照图4所示，在一实施例中，收敛条件可以包括但不限于步骤S401至步骤S403中至少之一。Referring to FIG. 4 , in an embodiment, the convergence condition may include but is not limited to at least one of steps S401 to S403.

步骤S401，获取异常事件的时间超过预设时间段的结束时间。Step S401: The time of obtaining the abnormal event exceeds the end time of the preset time period.

步骤S402，连续多个时间桶之间缓存异常事件的数量递减速率小于预设的目标递减速率。Step S402: The deceleration rate of the number of cached abnormal events between multiple consecutive time buckets is less than the preset target deceleration rate.

步骤S403，时间桶中异常事件的数量小于预设的桶内事件数量最小阈值。Step S403: The number of abnormal events in the time bucket is less than the preset minimum threshold for the number of events in the bucket.

在一实施例中，本发明实施例中的收敛条件可以有多个，可以在时间维度上判断什么时候数据源收集完成，以在时间维度上提高聚合效率和聚合能力，具体的，收敛条件可以包括步骤S401至步骤S403中的至少一个，可以理解的是，当满足上述步骤中的收敛条件中的其中一个时，即可判断数据源收集完成，因此停止缓存异常事件。In one embodiment, there can be multiple convergence conditions in the embodiment of the present invention. It can be judged in the time dimension when the data source collection is completed, so as to improve the aggregation efficiency and aggregation capability in the time dimension. Specifically, the convergence conditions can be Including at least one of steps S401 to S403, it can be understood that when one of the convergence conditions in the above steps is met, it can be determined that the data source collection is completed, and therefore the caching of exception events is stopped.

需要说明的是，判断获取异常事件的时间是否超过预设时间段的结束时间是收敛条件之一，具体的，预设时间段有一个起始时间和结束时间，当获取异常事件的时间超过预设时间段的结束时间，说明缓存时间截止，即本次缓存最后一个聚合点的时间到截止时间的时间区间，这个时间区间就是预设时间段的最大值，它限制了过长时间等待，通过本发明实施例的收敛条件实现强制结束缓存，在一实施例中，聚合时间区间最大值即预设时间段设定为60分钟，过了这个时间，就不再等待后续消息。It should be noted that judging whether the time for obtaining abnormal events exceeds the end time of the preset time period is one of the convergence conditions. Specifically, the preset time period has a start time and an end time. When the time for obtaining abnormal events exceeds the preset time, Set the end time of the time period to indicate the cache time deadline, that is, the time interval from the time of the last aggregation point of this cache to the deadline. This time interval is the maximum value of the preset time period, which limits excessive waiting. The convergence condition of the embodiment of the present invention implements forced termination of caching. In one embodiment, the maximum value of the aggregation time interval, that is, the preset time period, is set to 60 minutes. After this time, there will be no more waiting for subsequent messages.

需要说明的是，判断连续多个时间桶之间缓存异常事件的数量递减速率是否小于预设的目标递减速率是收敛条件之一，具体的，当设定连续三个时间桶的异常事件数量以一定速率递减进行判断，低于事件次数递减比率，即低于目标递减速率，也就是后桶数量低于前桶数量一定百分比，当低于目标递减速率时，判断不再需要缓存异常事件，如当目标递减速率为25％，目标递减速率的数值可根据实际需要设置，本发明实施例中的收敛条件实现边际效应递减后结束缓存，如图5所示，目标缓存区内各个时间桶中分别缓存了用户登录日志、配置路由日志、重启路由日志、通信异常、网络不通告警、关键性能指标异常(图中的KPI异常)、业务异常、业务重启告警等异常事件，其中，通信异常、网络不通告警、关键性能指标异常和业务重启告警为聚合点，在图5的示例中，时间桶4的异常事件数量是时间桶3的三分之一，高于设定的目标递减速率(25％)，所以，当前还不能结束缓存，继续接收异常事件。It should be noted that judging whether the deceleration rate of the number of cached abnormal events between multiple consecutive time buckets is less than the preset target deceleration rate is one of the convergence conditions. Specifically, when the number of abnormal events in three consecutive time buckets is set to The judgment is made by decreasing at a certain rate. It is lower than the decrease rate of the number of events, that is, it is lower than the target decrease rate, that is, the number of back buckets is lower than the number of front buckets by a certain percentage. When it is lower than the target decrease rate, it is judged that there is no need to cache abnormal events, such as When the target deceleration rate is 25%, the value of the target deceleration rate can be set according to actual needs. The convergence condition in the embodiment of the present invention ends caching after the marginal effect is reduced. As shown in Figure 5, the time buckets in the target cache area are respectively It caches user login logs, configuration routing logs, restart routing logs, communication exceptions, network unavailability alarms, key performance indicator anomalies (KPI exceptions in the figure), business exceptions, business restart alarms and other abnormal events. Among them, communication anomalies, network unavailability alarms, etc. Alarms, key performance indicator anomalies and business restart alarms are aggregation points. In the example in Figure 5, the number of abnormal events in time bucket 4 is one-third of that in time bucket 3, which is higher than the set target deceleration rate (25%). , so the cache cannot be ended yet and the exception events can continue to be received.

需要说明的是，判断时间桶中异常事件的数量是否小于预设的桶内事件数量最小阈值是收敛条件之一，具体的，时间桶的异常事件数量小于桶内事件数量最小阈值，即桶事件数最小值，桶内事件数量最小阈值可根据实际需要设置，当低于这个桶内事件数量最小阈值时，判断不再需要缓存异常事件，本发明实施例实现了以一定时间内事件收敛后结束缓存，同样以参考图5所示，时间桶4接收的事件只有1个，小于桶内事件数量最小阈值(假设是2)，即不用再接收异常事件，停止缓存异常事件，即时间桶5不必再接收，最终完成数据源的收集。It should be noted that judging whether the number of abnormal events in the time bucket is less than the preset minimum threshold for the number of events in the bucket is one of the convergence conditions. Specifically, the number of abnormal events in the time bucket is less than the minimum threshold for the number of events in the bucket, that is, bucket events The minimum number of events in the bucket can be set according to actual needs. When it is lower than the minimum threshold of the number of events in the bucket, it is judged that it is no longer necessary to cache abnormal events. The embodiment of the present invention realizes that the event ends after the events converge within a certain period of time. Caching, also shown in Figure 5, time bucket 4 receives only 1 event, which is less than the minimum threshold of the number of events in the bucket (assumed to be 2), that is, it no longer needs to receive abnormal events, and stops caching abnormal events, that is, time bucket 5 does not need to Receive again, and finally complete the collection of data sources.

在一实施例中，目标位置有多个，参照图6所示，上述步骤S201中还可以包括但不限于步骤S501和步骤S302。In one embodiment, there are multiple target locations. Referring to FIG. 6 , the above step S201 may also include but is not limited to step S501 and step S302.

步骤S501，分别获取各个目标位置对应的预设时间段。Step S501: Obtain the preset time period corresponding to each target position.

步骤S502，分别建立对应各个目标位置的目标缓存区，并根据各个预设时间段的总时长分别在对应的目标缓存区中建立多个时间桶。Step S502: Create target cache areas corresponding to each target location, and create multiple time buckets in the corresponding target cache areas according to the total duration of each preset time period.

在一实施例中，当目标位置有多个时，本发明实施例分别根据不同的目标位置进行数据源的缓存，具体的，每个不同的位置均可以对应设置一个自身缓存需要的预设时间段，本发明实施例分别获取各个目标位置对应的预设时间段，并对每个目标位置的数据进行缓存，分别建立对应各个目标位置的目标缓存区，在一实施例中，目标缓存区与目标位置一一对应，每个目标位置都有对应的一个目标缓存区，并根据各个预设时间段的总时长分别在对应的目标缓存区中建立多个时间桶，实现根据将各个目标位置的异常时间都缓存到对应的时间桶中。In one embodiment, when there are multiple target locations, the embodiment of the present invention caches data sources according to different target locations. Specifically, each different location can set a preset time required for its own cache. period, the embodiment of the present invention obtains the preset time period corresponding to each target location, caches the data of each target location, and establishes a target cache area corresponding to each target location. In one embodiment, the target cache area and There is a one-to-one correspondence between the target locations. Each target location has a corresponding target cache area, and multiple time buckets are established in the corresponding target cache areas according to the total duration of each preset time period, so that each target location can be allocated according to the total duration of each preset time period. Exception times are cached in the corresponding time bucket.

本发明实施例中的目标位置为空间维度上的位置，例如，目标位置可以为一个网元、一条链路或者一个机房，多个目标位置可以包括多个网元、链路和机房，通过对不同的目标位置进行异常事件的收集，可以得到各个不同的目标位置的聚合结果，以便对各个目标位置进行聚合分析，可以理解的是，本发明实施例中可以得到各个目标位置的聚合结果，通过该聚合结果可以分析各个目标位置自身的故障根因，也可以得到多个目标位置整体的聚合结果，通过该聚合结果可以分析得到多个目标位置中的故障根因，提高了聚合能力，提高故障运维水平。The target location in the embodiment of the present invention is a location in the spatial dimension. For example, the target location can be a network element, a link, or an computer room. Multiple target locations can include multiple network elements, links, and computer rooms. By By collecting abnormal events at different target locations, the aggregation results of each different target location can be obtained, so as to perform aggregate analysis on each target location. It can be understood that in the embodiment of the present invention, the aggregation results of each target location can be obtained, by This aggregation result can analyze the fault root cause of each target location itself, and can also obtain the overall aggregation result of multiple target locations. Through this aggregation result, the root cause of faults in multiple target locations can be analyzed, which improves the aggregation capability and improves the fault Operation and maintenance level.

参照图7所示，在一实施例中，上述步骤S102中还可以包括但不限于步骤S601和步骤S602。Referring to FIG. 7 , in one embodiment, the above step S102 may also include but is not limited to step S601 and step S602.

步骤S601，获取聚合点的筛选条件。Step S601: Obtain the filtering conditions of the aggregation point.

步骤S601，在多个异常事件中确定满足筛选条件的异常事件为聚合点。Step S601: Determine the abnormal event that satisfies the filtering conditions among multiple abnormal events as the aggregation point.

在一实施例中，本发明实施例可以获取聚合点的筛选条件，从异常事件中确定聚合点，通过筛选条件可以从异常事件中确定哪些是重大告警或重大关键性能指标异常，重大告警可以是异常事件中的告警中的任意一个，重大关键性能指标异常可以是异常事件中关键性能指标异常中的任意一个，例如，聚合点如基站退服、小区退服等，聚合点是真正运维的中心，以重大告警或重大关键性能指标异常的聚合点为中心聚合适用于实际运维的需要，否则大量普通告警等的聚合会大量浪费时间，造成运维水平低下，聚合点的具体告警类型和关键性能指标异常类型可以通过配置指定。In one embodiment, the embodiment of the present invention can obtain the filtering conditions of the aggregation points, determine the aggregation points from abnormal events, and determine which are major alarms or major key performance indicator abnormalities from the abnormal events through the filtering conditions. Major alarms can be Any of the alarms in abnormal events, major key performance indicator abnormalities can be any of the key performance indicator abnormalities in abnormal events, for example, the aggregation point such as base station out of service, cell out of service, etc. The aggregation point is the real operation and maintenance Center, with the aggregation point of major alarms or major key performance indicator abnormalities as the center, aggregation is suitable for actual operation and maintenance needs. Otherwise, aggregation of a large number of ordinary alarms will waste a lot of time, resulting in low operation and maintenance levels. The specific alarm types of aggregation points and Key performance indicator exception types can be specified through configuration.

可以理解的是，本发明实施例中可以根据实际运维需要自定义设定筛选条件，以确定其中的重大告警或重大关键性能指标异常，聚合点是众多异常事件中的其中一个或多个，由于异常事件是在预设时间段内不断获取的，因此聚合点所处的时间位于预设时间段的中间，可以理解的是，以聚合点的聚合，在聚合点的获取时间前后，可以包含多个异常事件，这些前后时间的异常事件可以是告警、关键性能指标异常和操作日志中的至少一种，因此，本发明实施例中，可以在聚合点之前事先由于某种操作引发了告警或关键性能指标异常时，可以聚合到聚合点之前和之后的数据，得到的聚合结果可以用于明确故障根因，实现双向聚合，能够提高数据源的聚合能力，提高故障运维水平。It can be understood that in the embodiment of the present invention, the filtering conditions can be customized according to actual operation and maintenance needs to determine major alarms or major key performance indicator anomalies. The aggregation point is one or more of the many abnormal events. Since abnormal events are continuously acquired within the preset time period, the time of the aggregation point is located in the middle of the preset time period. It can be understood that the aggregation of the aggregation point can include before and after the acquisition time of the aggregation point. Multiple abnormal events. These abnormal events at the time before and after can be at least one of alarms, key performance indicator exceptions, and operation logs. Therefore, in the embodiment of the present invention, an alarm or alarm may be triggered by a certain operation before the aggregation point. When key performance indicators are abnormal, the data before and after the aggregation point can be aggregated. The obtained aggregation results can be used to clarify the root cause of the fault and achieve two-way aggregation, which can improve the aggregation capabilities of the data source and improve the level of fault operation and maintenance.

需要说明的是，相关技术中，不会以某条操作日志作为起始点进行向后聚合，是因为日志太多太频繁，而且大多数操作日志只是为了记录并不是说明异常，所以对操作日志来说，由于操作日志不方便明确异常，所以无法即时感知，往往是告警或者关键性能指标异常后，再往前回头找相关操作日志，如内存泄漏这些故障，是已经发现了内存泄漏或者发现泄漏趋势，再往前找相关操作日志，因此导致聚合能力低下。It should be noted that in related technologies, a certain operation log is not used as the starting point for backward aggregation because there are too many and too frequent logs, and most operation logs are only for recording and do not explain abnormalities, so the operation logs are It is said that because the operation log is inconvenient to clearly identify exceptions, it is impossible to detect them immediately. It is often after an alarm or key performance indicator is abnormal, and then go back and look for the relevant operation logs. For example, memory leaks and other faults indicate that a memory leak has been discovered or a leak trend has been discovered. , and then look for relevant operation logs, thus resulting in low aggregation capabilities.

参照图8所示，在一实施例中，上述步骤S103中还可以包括但不限于步骤S701至步骤S703。Referring to FIG. 8 , in one embodiment, the above step S103 may also include but is not limited to steps S701 to S703.

步骤S701，在异常事件中确定第一目标事件和第二目标事件，其中，第一目标事件表征为聚合点的噪音事件，第二目标事件表征为聚合点的关联事件。Step S701: Determine the first target event and the second target event among the abnormal events, where the first target event is characterized as a noise event at the aggregation point, and the second target event is characterized as a related event at the aggregation point.

步骤S702，清除第一目标事件并保留第二目标事件。Step S702: Clear the first target event and retain the second target event.

步骤S703，根据聚合点和第二目标事件进行聚合，得到聚合结果。Step S703: Perform aggregation based on the aggregation point and the second target event to obtain an aggregation result.

在一实施例中，本发明实施例可以去异常事件进行去噪，去除其中没必要的事件，保留与聚合点相关的异常事件，以便提高聚合能力，具体的，本发明实施例中可以在异常事件中确定第一异常事件和第二异常事件，第一目标事件表征为聚合点的噪音事件，第二目标事件表征为聚合点的关联事件，作为噪音事件，若与聚合点进行聚合，会使最终的聚合结果的数据量过大，并存在众多对故障根因分析无用的异常事件，因此，本发明实施例中可以确定表征为聚合点的噪音事件，即确定第一目标事件，并确定表征为聚合点的关联事件，即第二目标事件表征，清除第一目标事件并保留第二目标事件，最终可以根据聚合点和第二目标事件进行聚合，得到聚合结果，可以提高本发明实施例的聚合能力，提高故障运维水平。In one embodiment, the embodiment of the present invention can denoise abnormal events, remove unnecessary events, and retain abnormal events related to the aggregation point, so as to improve the aggregation capability. Specifically, in the embodiment of the present invention, abnormal events can be removed The first abnormal event and the second abnormal event are determined in the event. The first target event is characterized as a noise event at the aggregation point, and the second target event is characterized as an associated event at the aggregation point. As a noise event, if aggregated with the aggregation point, it will cause The amount of data in the final aggregation result is too large, and there are many abnormal events that are useless for fault root cause analysis. Therefore, in the embodiment of the present invention, the noise event characterized as the aggregation point can be determined, that is, the first target event is determined, and the characterization To represent the associated events of the aggregation point, that is, the second target event, the first target event is cleared and the second target event is retained. Finally, aggregation can be performed based on the aggregation point and the second target event to obtain an aggregation result, which can improve the performance of the embodiment of the present invention. Aggregation capabilities to improve fault operation and maintenance levels.

需要说明的是，由于本发明实施例中的异常事件包含了操作日志，在实际运维的过程中，会存在大量的操作日志，本发明实施例中通过双向聚合，可以得到包含聚合点前后的异常事件以得到聚合结果，即可以得到聚合点前后的操作日志以得到聚合结果，最终可以根据聚合结果进行故障根因找到导致聚合点异常的操作日志等，为了解决操作日志过多且大量与聚合点无关的问题，本发明实施例通过明确异常事件中的第一目标事件和第二目标事件，清除第一目标事件并保留第二目标事件，最终保证了本发明实施例的聚合能力和效率。It should be noted that since the abnormal events in the embodiment of the present invention include operation logs, in the actual operation and maintenance process, there will be a large number of operation logs. In the embodiment of the present invention, through two-way aggregation, the data before and after the aggregation point can be obtained. Abnormal events can be used to obtain the aggregation results, that is, the operation logs before and after the aggregation point can be obtained to obtain the aggregation results. Finally, the root cause of the fault can be determined based on the aggregation results to find the operation logs that caused the aggregation point exception. In order to solve the problem of excessive and large number of operation logs and aggregation Regardless of the issue, the embodiment of the present invention ultimately ensures the aggregation capability and efficiency of the embodiment of the present invention by clarifying the first target event and the second target event in the abnormal event, clearing the first target event and retaining the second target event.

以图5中收集的异常事件为例子，当以通信异常这个异常事件作为聚合点时，可以根据图9所示对通信异常进行前后双向聚合，向前可以聚合用户登录日志、配置路由日志和重启路由日志等异常事件，向后聚合可以聚合网络不通告警、关键性能指标异常、业务异常等异常事件，而当以网络不通告警这个异常事件作为聚合点时，可以根据图10所示对通信异常进行前后双向聚合，向前可以聚合用户登录日志、配置路由日志、重启路由日志和通信异常等异常事件，向后聚合可以聚合关键性能指标异常、业务异常和业务重启告警等异常事件。Taking the abnormal events collected in Figure 5 as an example, when the abnormal event of communication exception is used as the aggregation point, communication exceptions can be aggregated in both directions as shown in Figure 9. In the forward direction, user login logs, configuration routing logs and restarts can be aggregated. For abnormal events such as routing logs, backward aggregation can aggregate network unreachable alarms, key performance indicator anomalies, business anomalies and other abnormal events. When the abnormal event of network unreachable alarm is used as the aggregation point, communication anomalies can be processed as shown in Figure 10 Backward and forward two-way aggregation, forward can aggregate user login logs, configuration routing logs, restart routing logs, communication exceptions and other abnormal events, backward aggregation can aggregate key performance indicator abnormalities, business exceptions and business restart alarms and other abnormal events.

参照图11所示，在一实施例中，上述步骤S103中还可以包括但不限于步骤S801和步骤S802。Referring to FIG. 11 , in one embodiment, the above step S103 may also include but is not limited to step S801 and step S802.

步骤S801，将聚合点和异常事件进行聚合得到聚合包。Step S801: Aggregate aggregation points and abnormal events to obtain an aggregation package.

步骤S802，对聚合包进行根因识别，并结合各个聚合点对应的异常事件得到聚合点的根因识别结果。Step S802: Perform root cause identification on the aggregation package, and combine the abnormal events corresponding to each aggregation point to obtain the root cause identification result of the aggregation point.

在一实施例中，本发明实施例中可以进行根因识别，得到根因识别结果，本发明实施例中根据将聚合点和异常事件进行聚合得到聚合包，对聚合包进行根因识别，并结合各个聚合点对应的异常事件得到聚合点的根因识别结果，在另一实施例中，本发明实施例中根据聚合点和第二目标事件来进行聚合得到聚合包，通过对第二目标事件进行聚合，得到聚合效率更高的聚合包，把这些有用的异常事件进行根因识别，可以借用聚合点中的第二目标事件和知识库等技术分析哪个异常事件是根因事件，从而提高了故障运维水平。In one embodiment, in the embodiment of the present invention, root cause identification can be performed to obtain the root cause identification result. In the embodiment of the present invention, the aggregation point and the abnormal event are aggregated to obtain an aggregation package, and the root cause identification is performed on the aggregation package, and The root cause identification result of the aggregation point is obtained by combining the abnormal events corresponding to each aggregation point. In another embodiment, in the embodiment of the present invention, aggregation is performed based on the aggregation point and the second target event to obtain an aggregation package. By analyzing the second target event Perform aggregation to obtain an aggregation package with higher aggregation efficiency, and identify the root cause of these useful abnormal events. You can use the second target event and knowledge base in the aggregation point and other technologies to analyze which abnormal event is the root cause event, thus improving the efficiency of the aggregation. Failure operation and maintenance level.

参照图12所示，在一实施例中，上述步骤S701中还可以包括但不限于步骤S901和步骤S902。Referring to FIG. 12 , in one embodiment, the above step S701 may also include but is not limited to step S901 and step S902.

步骤S901，对异常事件进行初始化处理得到初始数据，并将初始数据输入至预设的双向聚合模型中进行概率计算，分别得到各个异常事件与对应的聚合点的噪音概率值。Step S901: Perform initialization processing on the abnormal events to obtain initial data, and input the initial data into the preset two-way aggregation model for probability calculation to obtain the noise probability values of each abnormal event and the corresponding aggregation point.

步骤S902，根据噪音概率值确定异常事件中的第一目标事件和第二目标事件。Step S902: Determine the first target event and the second target event among the abnormal events according to the noise probability value.

在一实施例中，本发明实施例中通过获取预设的双向聚合模型，来确定异常事件中的第一目标事件和第二目标事件，双向聚合模型是一种通过神经网络模型训练得到的数据处理模型，具体的，本发明实施例中通过对异常事件进行初始化处理得到初始数据，并将初始数据输入至预设的双向聚合模型中进行概率计算，分别得到各个异常事件与对应的聚合点的噪音概率值，双向聚合模型的输入需要匹配对应的初始数据，以便双向聚合模型进行数据处理，噪音概率值可以表征该异常事件是对应的聚合点的噪音事件的概率大小，通过噪音概率值表征的概率大小就可以确定该异常事件是不是对应的聚合点的噪音事件，从而确定第一目标事件和第二目标事件。In one embodiment, the embodiment of the present invention determines the first target event and the second target event in the abnormal event by obtaining a preset two-way aggregation model. The two-way aggregation model is a kind of data obtained through neural network model training. Processing model. Specifically, in the embodiment of the present invention, initial data is obtained by initializing abnormal events, and the initial data is input into a preset two-way aggregation model for probability calculation, and the values of each abnormal event and the corresponding aggregation point are obtained. Noise probability value. The input of the two-way aggregation model needs to match the corresponding initial data so that the two-way aggregation model can perform data processing. The noise probability value can represent the probability that the abnormal event is a noise event at the corresponding aggregation point. It is characterized by the noise probability value. Based on the probability, it can be determined whether the abnormal event is a noise event at the corresponding aggregation point, thereby determining the first target event and the second target event.

可以理解的是，本发明实施例中的聚合点可以有多个，当聚合点为多个时，每个异常事件均可以通过双向聚合模型进行概率计算，得到针对各个聚合点的噪音概率值，这是由于，有些异常事件对某些聚合点是低概率，但对其它聚合点是高概率，因此将每个异常事件与各个聚合点进行概率计算，可以避免去除一些高概率的异常事件，有助于对所有的聚合点进行聚合。It can be understood that there can be multiple aggregation points in the embodiment of the present invention. When there are multiple aggregation points, each abnormal event can be subject to probability calculation through a two-way aggregation model to obtain the noise probability value for each aggregation point. This is because some abnormal events have low probability for some aggregation points, but high probability for other aggregation points. Therefore, calculating the probability of each abnormal event and each aggregation point can avoid removing some high-probability abnormal events. Helps to aggregate all aggregation points.

参照图13所示，在一实施例中，上述步骤S902中还可以包括但不限于步骤S1001至步骤S1003。Referring to FIG. 13 , in one embodiment, the above step S902 may also include but is not limited to steps S1001 to S1003.

步骤S1001，获取各个聚合点的第一概率阈值和第二概率阈值。Step S1001: Obtain the first probability threshold and the second probability threshold of each aggregation point.

步骤S1002，将低于所有第一概率阈值的噪音概率值对应的异常事件确定为第一目标事件。Step S1002: Determine abnormal events corresponding to noise probability values lower than all first probability thresholds as first target events.

步骤S1003，将高于任意一个第二概率阈值的噪音概率值对应的异常事件确定为第二目标事件。Step S1003: Determine an abnormal event corresponding to a noise probability value higher than any second probability threshold as a second target event.

在一实施例中，本发明实施例通过设定低概率阈值和高概率阈值来对异常事件进行筛选，本发明实施例可以获取各个聚合点的第一概率阈值和第二概率阈值，第一概率阈值为低概率阈值，用于是筛选得到异常事件中的第一目标事件，因此将低于所有第一概率阈值的噪音概率值对应的异常事件确定为第一目标事件，第一目标事件为低概率事件，第二概率阈值为高概率阈值，用于筛选得到第二目标事件，将高于任意一个第二概率阈值的噪音概率值对应的异常事件确定为第二目标事件，第二目标事件为高概率事件。In one embodiment, the embodiment of the present invention filters abnormal events by setting a low probability threshold and a high probability threshold. The embodiment of the present invention can obtain the first probability threshold and the second probability threshold of each aggregation point. The first probability The threshold is a low probability threshold, which is used to filter out the first target event among abnormal events. Therefore, the abnormal event corresponding to the noise probability value lower than all the first probability thresholds is determined as the first target event, and the first target event is low probability. event, the second probability threshold is a high probability threshold, used to filter out the second target event, and determine the abnormal event corresponding to the noise probability value higher than any second probability threshold as the second target event, and the second target event is high Probabilistic events.

需要说明的是，本发明实施例中将低于低概率阈值的标记，作为低概率阈值的第一概率阈值可以在界面或配置文件设置，根据实际运维需要配置，将高于高概率阈值的异常事件放在一个高概率列表中，作为高概率阈值的第二概率阈值可以在界面或配置文件设置，根据实际运维需要配置，其键值为异常点，值为列表，列表中保存这些高于高概率阈值的异常事件，低于低概率阈值的异常事件在分析每个聚合点时暂时不忙排除，因为有些异常事件对某些聚合点是低概率但对其它聚合点是高概率，因此本发明实施例中在判断哪些异常事件为第一目标事件时，是要求对异常事件的噪音概率值低于所有的聚合点的第一概率阈值才确定为第一目标事件，而判断得到第二目标事件时，异常事件的噪音概率值只需要高于任意一个聚合点的第二概率阈值即可判断为第二目标事件。It should be noted that in the embodiment of the present invention, marks that are lower than the low probability threshold will be used as the first probability threshold of the low probability threshold, which can be set in the interface or configuration file, and configured according to actual operation and maintenance needs, and marks that are higher than the high probability threshold will be Abnormal events are placed in a high probability list. The second probability threshold as the high probability threshold can be set in the interface or configuration file and configured according to actual operation and maintenance needs. The key value is the abnormal point and the value is a list. These high probability thresholds are saved in the list. Abnormal events above the high probability threshold, and abnormal events below the low probability threshold are temporarily excluded when analyzing each aggregation point, because some abnormal events are low probability for some aggregation points but high probability for other aggregation points, so In the embodiment of the present invention, when determining which abnormal events are the first target events, it is required that the noise probability value of the abnormal event is lower than the first probability threshold of all aggregation points before it is determined as the first target event, and the second target event is determined. When a target event occurs, the noise probability value of the abnormal event only needs to be higher than the second probability threshold of any aggregation point to determine it as the second target event.

在一实施例中，第一概率阈值，以某个聚合点来预测本聚合点的上下文关联事件时，如果某些异常事件的概率很低，对所有聚合点的概率都低于第一概率阈值，如设置成10％，则可以当作噪音去噪；第二概率阈值，以某个聚合点来预测本聚合点的上下文关联事件时，如果相关某些异常事件的概率高于第二概率阈值，如设置成75％，则可以认为相关性很强，可以协助后续根因分析。In one embodiment, the first probability threshold is used to predict the context-related events of this aggregation point using a certain aggregation point. If the probability of some abnormal events is very low, the probabilities for all aggregation points are lower than the first probability threshold. , if set to 10%, it can be used as noise denoising; the second probability threshold, when using a certain aggregation point to predict the context-related events of this aggregation point, if the probability of certain abnormal events is higher than the second probability threshold , if set to 75%, it can be considered that the correlation is very strong and can assist in subsequent root cause analysis.

参照图14所示，在一实施例中，上述步骤S901中还可以包括但不限于步骤S1101至步骤S1103。Referring to FIG. 14 , in one embodiment, the above step S901 may also include but is not limited to steps S1101 to S1103.

步骤S1101，对异常事件进行独热编码，得到初始化后的初始向量数据。Step S1101: Perform one-hot encoding on the abnormal event to obtain initialized initial vector data.

步骤S1102，获取预设的双向聚合模型，其中，双向聚合模型根据获取样本中的样本异常事件、表征为噪音事件的样本目标事件、和样本聚合点，并通过无监督训练后得到。Step S1102: Obtain a preset bidirectional aggregation model, where the bidirectional aggregation model is obtained through unsupervised training based on sample abnormal events in the acquired samples, sample target events characterized as noise events, and sample aggregation points.

步骤S1103，将初始向量数据输入至预设的双向聚合模型中进行概率计算，分别得到各个异常事件与对应的聚合点的噪音概率值。Step S1103: Input the initial vector data into the preset two-way aggregation model for probability calculation, and obtain the noise probability values of each abnormal event and the corresponding aggregation point.

在一实施例中，需要对异常事件进行初始化的向量转换后，才输入到预设的双向聚合模型中进行处理，得到所需要的噪音概率值。具体的，双向聚合模型可以预先根据样本中的数据建立得到，本发明实施例由于在聚合中，会有大量的异常事件，而其中对根因分析并非所有事件都有用，有一些事件对聚合点来说，是噪音事件，如一些日常操作的操作日志，闪断告警正好跟异常点在某一时间窗口，它们的存在干扰了聚合分析，因此通过人工智能(Artificial Intelligence，AI)训练，通过一定概率来过滤，能够让聚合分析更加准确。In one embodiment, the abnormal event needs to be initialized into a vector before it is input into a preset two-way aggregation model for processing to obtain the required noise probability value. Specifically, the two-way aggregation model can be established in advance based on the data in the sample. In the embodiment of the present invention, there will be a large number of abnormal events during aggregation, and not all of them are useful for root cause analysis. Some events are useful for the aggregation point. Generally speaking, they are noise events, such as some daily operation operation logs, and flash alarms happen to be in a certain time window with the abnormal point. Their existence interferes with the aggregate analysis. Therefore, through artificial intelligence (AI) training, through a certain Filtering based on probability can make aggregate analysis more accurate.

如同在自然语言处理(Natural Language Processing，NLP)中，词语向量化(Wordvecor(word embedding)，Word2vec)的跳字模型(Continuous Skip-Gram Model，Skip-gram)模型，使用中心词预测上下文词语的概率的这个原理，本发明实施例使用同样的原理，把异常事件向量化后，通过双向聚合模型得到聚合点对前后时间段的异常事件的概率大小，通过概率阈值进行异常事件去噪。Just like in Natural Language Processing (NLP), the Continuous Skip-Gram Model (Skip-gram) model of word vectorization (Wordvecor (word embedding), Word2vec) uses the center word to predict the context words. This principle of probability, the embodiment of the present invention uses the same principle, after vectorizing abnormal events, obtains the probability of abnormal events in the time period before and after the aggregation point pair through a two-way aggregation model, and denoises abnormal events through probability thresholds.

本发明实施例中可以设置训练器，来加载历史数据，历史数据可以包括样本中的样本异常事件、表征为噪音事件的样本目标事件、和样本聚合点，在训练阶段，对样本中的这些数据进行独热编码(One-hot coding)后，通过无监督训练，当上下文概率最大，损失函数最小，即可把异常事件向量化，则训练得到双向聚合模型，以后后续应用需要。In the embodiment of the present invention, a trainer can be set up to load historical data. The historical data can include sample abnormal events in the sample, sample target events characterized as noise events, and sample aggregation points. In the training phase, these data in the sample After performing one-hot coding, through unsupervised training, when the context probability is maximum and the loss function is minimum, abnormal events can be vectorized, and a bidirectional aggregation model is obtained through training, which will be needed for subsequent applications.

本发明实施例中在进行概率计算时，先加载训练好的双向聚合模型，对异常事件进行独热编码，得到初始化后的初始向量数据，将初始向量数据输入至预设的双向聚合模型中进行概率计算，分别得到各个异常事件与对应的聚合点的噪音概率值，输入在双向聚合模型钱已经通过独热编码进行初始化向量表达了，因此可以直接使用通过训练发布得到的双向模型对事件进行概率计算，对异常事件进行向量概率计算，得到各个异常事件与对应的聚合点的噪音概率值。In the embodiment of the present invention, when performing probability calculations, the trained two-way aggregation model is first loaded, one-hot encoding is performed on the abnormal events, and the initial vector data after initialization is obtained, and the initial vector data is input into the preset two-way aggregation model. Probability calculations are used to obtain the noise probability values of each abnormal event and the corresponding aggregation point. The money input into the bidirectional aggregation model has been expressed as an initialization vector through one-hot encoding. Therefore, the bidirectional model obtained through training and release can be used directly to calculate the probability of the event. Calculate, perform vector probability calculation on abnormal events, and obtain the noise probability value of each abnormal event and the corresponding aggregation point.

可以理解的是，本发明实施可使用但不限于Word2vec的Skip-gram模型进行训练，包括神经网络的搭建，得到所需要的双向聚合模型，首先获取样本中的历史数据，或者根据故障处理手册等当作语料库进行独热编码后，用Skip-gram模型，其损失函数为所有概率最小化，这个时候，不同的告警、操作日志等异常事件，它们的相关性是经过训练得到中间隐藏层，这个就是最终需要的模型，训练得到双向聚合模型的步骤在本发明实施例中不做具体描述。It can be understood that the implementation of the present invention can use but is not limited to the Skip-gram model of Word2vec for training, including the construction of neural networks to obtain the required two-way aggregation model. First, obtain the historical data in the sample, or according to the fault handling manual, etc. After the corpus is subjected to one-hot encoding, the Skip-gram model is used, and its loss function is to minimize all probabilities. At this time, the correlation of different alarms, operation logs and other abnormal events is obtained through training to obtain the intermediate hidden layer. This It is the model that is ultimately required. The steps of training to obtain the bidirectional aggregation model will not be described in detail in the embodiment of the present invention.

参照图15所示，在一实施例中，上述步骤S901中还可以包括但不限于步骤S1201和步骤S1202。Referring to FIG. 15 , in an embodiment, the above step S901 may also include but is not limited to step S1201 and step S1202.

步骤S1201，将多个聚合点按照时间排序并存放在聚合点列表中。Step S1201: Sort multiple aggregation points according to time and store them in an aggregation point list.

步骤S1202，将初始数据输入至预设的双向聚合模型中，并按照聚合点列表中的各个聚合点分别对异常事件进行概率计算，得到各个异常事件与对应的聚合点的噪音概率值。Step S1202, input the initial data into the preset two-way aggregation model, and perform probability calculations on abnormal events according to each aggregation point in the aggregation point list to obtain the noise probability value of each abnormal event and the corresponding aggregation point.

在一实施例中，本发明实施例中通过建立聚合点列表来存放聚合点，在目标缓存区停止缓存异常事件后，目标缓存区关闭不再接收其它异常事件，把时间桶封装成初始包准备聚合，然后清除缓存区，以待后面的事件缓存，需要强调的是，从聚合点角度来看，和常规做法单向往后聚合的不同在于，本发明实施例在关闭目标缓存区时，已经包含了前后双向的事件，即聚合点为中心的前后事件均已进入缓存区，本发明实施例中的双向聚合是以聚合点为准来双向聚合，因此可以收集目标缓存区的聚合点，如果没有聚合点，则目标缓存区直接回收用于下一次缓存，如果一个目标缓存区甚至缓存区中某一个桶内可能有多个聚合点，先把聚合点收集起来，按时间排序，存放在聚合点列表中，然后给出缓存区中最早发生时间的异常事件和最迟发生时间的异常事件，此外，还可以给出目标缓存区的位置，如网元、机房，或链路。In one embodiment, the embodiment of the present invention stores the aggregation points by establishing an aggregation point list. After the target cache stops caching exception events, the target cache is closed and no longer receives other exception events, and the time bucket is encapsulated into an initial package for preparation. aggregation, and then clear the cache area to wait for subsequent event caching. It needs to be emphasized that from the perspective of the aggregation point, the difference from the conventional one-way backward aggregation is that when closing the target cache area, the embodiment of the present invention already includes The two-way events before and after, that is, the events before and after the aggregation point as the center have entered the cache area. The two-way aggregation in the embodiment of the present invention is based on the aggregation point for bidirectional aggregation. Therefore, the aggregation point of the target cache area can be collected. If there is no aggregation point, the target cache area is directly recycled for the next cache. If there may be multiple aggregation points in a target cache area or even a bucket in the cache area, the aggregation points are collected first, sorted by time, and stored in the aggregation point In the list, the exception events with the earliest occurrence time and the latest occurrence time in the cache area are then given. In addition, the location of the target cache area can also be given, such as network elements, computer rooms, or links.

需要说明的是，在进行概率计算时，本发明实施例先得到聚合点列表，然后对列表中的聚合点通过双向模型得到数据中其它异常事件的噪音概值率，将低于低概率阈值(的标记，将高于高概率阈值的异常事件放在一个高概率列表中，其键值为异常点，值为列表，列表中保存这些高于高概率阈值的异常事件，最终在去噪后，可以把聚合点附加上高概率列表，组合成聚合包，发给根因识别。It should be noted that when performing probability calculations, the embodiment of the present invention first obtains a list of aggregation points, and then uses a two-way model to obtain the noise probability rate of other abnormal events in the data for the aggregation points in the list, which will be lower than the low probability threshold ( mark, put the abnormal events higher than the high probability threshold in a high probability list, the key value is the abnormal point, and the value is the list. These abnormal events higher than the high probability threshold are saved in the list. Finally, after denoising, The aggregation point can be attached to a high probability list, combined into an aggregation package, and sent to root cause identification.

此外，本发明实施例中的异常事件处理方法可以应用在异常事件处理装置中，简称处理装置，处理装置可以包括：In addition, the abnormal event processing method in the embodiment of the present invention can be applied in an abnormal event processing device, referred to as a processing device, and the processing device may include:

缓存器：接收外部异常事件进入缓存区，组装成初始包发送给打包器；Cache: receives external exception events into the cache area, assembles them into initial packages and sends them to the packager;

打包器：接收初始包，打包成编码包发送给聚合器；Packer: receives the initial packet, packages it into an encoded packet and sends it to the aggregator;

聚合器：接收编码包，以聚合点为中心进行上下文概率训练和预测，去噪，得到聚合包，发送给根因分析；Aggregator: receives the coding packet, performs context probability training and prediction centered on the aggregation point, denoises, obtains the aggregation packet, and sends it to root cause analysis;

训练器、缓存器、打包器、聚合器和训练器之间通信连接，通过处理装置执行上述实施例中的异常事件处理方法时，可以包括以下四步：The communication connection between the trainer, the cache, the packager, the aggregator and the trainer, when executing the exception event processing method in the above embodiment through the processing device, may include the following four steps:

第一步：训练器训练双向聚合模型完成事件向量化。The first step: the trainer trains the bidirectional aggregation model to complete event vectorization.

具体的，在聚合中，可能会有大量的异常事件，而其中对根因分析并非所有事件都有用，有一些事件对聚合点来说，是噪音事件，如一些日常操作日志，闪断告警正好跟异常点在某一时间窗口，它们的存在干扰了聚合分析，因此通过AI训练，通过一定概率来过滤，能够让聚合分析更加准确。Specifically, in aggregation, there may be a large number of abnormal events, and not all of them are useful for root cause analysis. Some events are noise events for the aggregation point, such as some daily operation logs and flash alarms. In a certain time window with abnormal points, their existence interferes with the aggregation analysis. Therefore, through AI training and filtering with a certain probability, the aggregation analysis can be made more accurate.

如同在NLP中，Word2vec的Skip-gram模型，使用中心词预测上下文词语的概率的这个原理，本发明实施例使用同样的原理，把异常事件向量化后，通过双向聚合模型得到聚合点对前后时间段的异常事件的概率大小，通过概率阈值进行异常事件去噪。Just like in NLP, the Skip-gram model of Word2vec uses the principle of predicting the probability of context words using the center word. The embodiment of the present invention uses the same principle to vectorize the abnormal events and obtain the time before and after the aggregation point pair through the two-way aggregation model. The probability of abnormal events in the segment is determined, and the abnormal events are denoised through the probability threshold.

训练器加载历史告警、日志和关键性能指标异常以及故障处理手册等当作语料库进行独热编码后，通过无监督训练，当上下文概率最大，损失函数最小，即可把异常事件向量化，则得到双向聚合模型，然后发布模型。After the trainer loads historical alarms, logs, key performance indicator anomalies, and troubleshooting manuals as a corpus for one-hot encoding, through unsupervised training, when the context probability is the largest and the loss function is the smallest, the abnormal events can be vectorized, then we get Bi-directionally aggregate models and then publish the model.

第二步：缓存器接收并缓存流式异常事件。Step 2: The cache receives and caches streaming exception events.

异常事件是流式输入，所以需要缓存一定时间段的异常事件。Exception events are streaming input, so exception events need to be cached for a certain period of time.

缓存器根据不同的空间维度，即不同位置设置不同的缓存区，一个缓存区只能缓存同一个空间维度的异常事件，每个缓存区按照时间戳和一定时间区间当作时间桶的方式来缓存异常事件，如果该事件是聚合点，进行标记。The cache sets different cache areas according to different spatial dimensions, that is, different locations. One cache area can only cache abnormal events in the same spatial dimension. Each cache area is cached as a time bucket according to the timestamp and a certain time interval. Exception events, if the event is a convergence point, mark it.

有了聚合点后，并不要立即聚合，还需要等待一定时间，等到时间桶缓存完毕，才开始准备聚合。一个时间桶为一个时间区间，如五分钟，里面缓存这5分钟的异常事件，下一个时间桶则缓存下一个时间区间，如五分钟的异常事件。After you have an aggregation point, you do not need to aggregate immediately. You also need to wait for a certain period of time until the time bucket is cached before starting to prepare for aggregation. A time bucket is a time interval, such as five minutes, and the abnormal events of these 5 minutes are cached in it. The next time bucket caches the abnormal events of the next time interval, such as five minutes.

如参考图5，流式异常事件进入后，同一位置一个缓存区，图中每5分钟一个时间桶缓存一批异常事件，不同时间桶可能大小不一样。As shown in Figure 5, after the streaming exception event enters, there is a cache area at the same location. In the figure, a time bucket caches a batch of exception events every 5 minutes. The size of the bucket may be different at different times.

一个缓存区由一个或多个时间桶组成，关键是什么时候截止，即本次缓存完毕，可以聚合了，本发明实施例采用三个维度作为收敛条件完成最后一个时间桶的缓存：A cache area consists of one or more time buckets. The key is when to end, that is, the cache is completed and can be aggregated. The embodiment of the present invention uses three dimensions as convergence conditions to complete the cache of the last time bucket:

缓存时间截止，即本次缓存最后一个聚合点的时间到截止时间的时间区间，这个时间区间就是时间区间最大值，它限制了过长时间等待，这个做法是强制结束；The cache time expires, that is, the time interval from the time of the last aggregation point of this cache to the deadline. This time interval is the maximum time interval. It limits excessive waiting. This approach is to force the end;

连续三个时间桶的异常事件数量以一定速率递减，低于事件次数递减比率，即后桶数量低于前桶数量一定百分比，如25％，这个数字可设置，这个做法是边际效应递减后结束，以参考图5示意，时间桶4的事件数量是时间桶3的三分之一，所以，还不能结束，继续接收异常事件；The number of abnormal events in three consecutive time buckets decreases at a certain rate, which is lower than the event number decrement ratio. That is, the number of subsequent buckets is lower than the number of previous buckets by a certain percentage, such as 25%. This number can be set. This approach ends after the marginal effect decreases. , as shown in Figure 5, the number of events in time bucket 4 is one-third of that in time bucket 3, so it cannot end yet and continue to receive abnormal events;

时间桶的异常事件数量小于桶内事件数量最小阈值，即桶事件数最小值，这个值可设置，这个做法是一定时间内事件收敛后结束，同样以参考图5示意，时间桶4接收的事件只有1个，小于桶内事件数量最小阈值(假设是2)，即不用再接收，即时间桶5不必再接收异常数据。The number of abnormal events in the time bucket is less than the minimum threshold of the number of events in the bucket, that is, the minimum number of events in the bucket. This value can be set. This method is to end after the events converge within a certain period of time. Also shown in Figure 5, the events received by time bucket 4 There is only 1, which is less than the minimum threshold of the number of events in the bucket (assumed to be 2), that is, it does not need to be received anymore, that is, time bucket 5 does not need to receive abnormal data anymore.

当上述三个条件任何一个条件满足后，本缓存区缓存完毕，即本缓存区关闭不再接收其它异常事件，把时间桶封装成初始包发给打包器准备聚合，然后清除缓存区，以待后面的事件缓存。When any of the above three conditions is met, the buffer area is cached, that is, the buffer area is closed and no longer receives other abnormal events. The time bucket is encapsulated into an initial package and sent to the packager to prepare for aggregation, and then the buffer area is cleared to wait for Later event caching.

需要强调的是，从聚合点角度来看，和常规做法单向往后聚合的不同在于，本发明实施例关闭缓存区时，已经包含了前后双向的事件，即聚合点为中心的前后事件均已进入缓存区。It needs to be emphasized that from the perspective of the aggregation point, the difference from the conventional one-way backward aggregation is that when the buffer area is closed in the embodiment of the present invention, the events in the front and back directions are already included, that is, the events before and after the aggregation point are centered. Enter the cache area.

第三步：打包器进行初始包进行打包。Step 3: The packager performs the initial package for packaging.

缓存完毕后，打包器对初始包进行打包成聚合包，本发明的双向聚合是以聚合点为准来双向聚合，因此打包器首先收集本缓存区的聚合点，如果没有聚合点，则本缓存区直接回收用于下一次缓存，如果一个缓存区甚至缓存区中某一个桶内可能有多个聚合点，先把聚合点收集起来，按时间排序，然后给出缓存区中最早发生时间的异常事件和最迟发生时间的异常事件，给出本缓存区的位置，如网元、机房，或链路，对本缓存区的异常事件进行独热编码，完成上述操作后，打包器打包完毕，得到编码包，打包器把编码包发送给聚合器进行聚合。After the cache is completed, the packager packages the initial package into an aggregate package. The two-way aggregation of the present invention is based on the aggregation point. Therefore, the packager first collects the aggregation points of this cache area. If there is no aggregation point, the cache The area is directly recycled for the next cache. If there may be multiple aggregation points in a cache area or even a bucket in the cache area, first collect the aggregation points, sort them by time, and then give the earliest exception in the cache area. Events and abnormal events with the latest occurrence time are given, and the location of this cache area is given, such as network elements, computer rooms, or links. One-hot encoding is performed on the abnormal events in this cache area. After completing the above operations, the packager completes packaging, and we get Encoding package, the packager sends the encoding package to the aggregator for aggregation.

第四步：聚合。Step 4: Aggregation.

聚合器加载训练好的双向聚合模型对聚合包中的聚合点向前和向后双去噪完成聚合。The aggregator loads the trained bidirectional aggregation model and performs forward and backward bidirectional denoising on the aggregation points in the aggregation package to complete the aggregation.

在第三步完毕时，聚合器收到打包器发送过来的编码包，由于已经独热编码了，可以直接使用通过训练发布得到的双向模型对事件进行向量化，对包内异常事件进行向量概率计算。At the end of the third step, the aggregator receives the encoded packet sent by the packer. Since it has been one-hot encoded, it can directly use the bidirectional model obtained through training and release to vectorize the event and vectorize the probability of abnormal events in the packet. calculate.

聚合器先得到聚合点列表，然后对列表中的聚合点通过双向模型得到本包中其它异常事件的概率，将低于低概率阈值(可在界面或配置文件设置)的标记，将高于高概率阈值的异常事件(可能也是聚合点)放在一个高概率列表中，其键值为异常点，值为列表，列表中保存这些高于高概率阈值的异常事件。The aggregator first obtains a list of aggregation points, and then uses a two-way model to obtain the probability of other abnormal events in this package for the aggregation points in the list. Marks that are lower than the low probability threshold (can be set in the interface or configuration file) will be higher than the high probability threshold. Abnormal events with probability thresholds (which may also be aggregation points) are placed in a high probability list, with the key value being the abnormal point and the value being the list. The list stores these abnormal events that are higher than the high probability threshold.

注意，低于低概率阈值的异常事件在分析每个聚合点时暂时不忙排除，因为有些异常事件对某些聚合点是低概率但对其它聚合点是高概率，当该编码包中所有聚合点分析完毕后，对所有标记低概率的异常事件进行查看，如果其对所有聚合点概率都低于最低概率阈值，则清除。Note that abnormal events below the low probability threshold are not temporarily excluded when analyzing each aggregation point, because some abnormal events are low probability for some aggregation points but high probability for other aggregation points. When all aggregations in this encoding package After the point analysis is completed, all abnormal events marked with low probability are viewed and cleared if their probability for all aggregation points is lower than the minimum probability threshold.

聚合器在分析完所有聚合点后，还可以再将标记的低于低概率阈值的异常事件进行二度检查，看它们是否对每个聚合点都是低于低概率，如果否，则保留，否则进行去噪清除，去噪后，聚合器把编码包中的聚合点附加上高概率列表，组合成聚合包，发给根因识别，本次聚合完毕。After the aggregator has analyzed all aggregation points, it can also conduct a second check on the marked abnormal events that are lower than the low probability threshold to see whether they are lower than low probability for each aggregation point. If not, they will be retained. Otherwise, denoising is performed. After denoising, the aggregator attaches a high probability list to the aggregation points in the coding packet, combines them into an aggregation packet, and sends it to root cause identification. This aggregation is completed.

图16示出了本发明实施例提供的电子设备100。电子设备100包括：处理器110、存储器120及存储在存储器120上并可在处理器110上运行的计算机程序，计算机程序运行时用于执行上述的异常事件处理方法。Figure 16 shows an electronic device 100 provided by an embodiment of the present invention. The electronic device 100 includes: a processor 110, a memory 120, and a computer program stored on the memory 120 and executable on the processor 110. When the computer program is run, it is used to execute the above-mentioned abnormal event processing method.

处理器110和存储器120可以通过总线或者其他方式连接。The processor 110 and the memory 120 may be connected through a bus or other means.

存储器120作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序以及非暂态性计算机可执行程序，如本发明实施例描述的异常事件处理方法。处理器110通过运行存储在存储器120中的非暂态软件程序以及指令，从而实现上述的异常事件处理方法。As a non-transitory computer-readable storage medium, the memory 120 can be used to store non-transitory software programs and non-transitory computer executable programs, such as the abnormal event processing method described in the embodiment of the present invention. The processor 110 implements the above-mentioned exception event processing method by running non-transient software programs and instructions stored in the memory 120 .

存储器120可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储执行上述的异常事件处理方法。此外，存储器120可以包括高速随机存取存储器120，还可以包括非暂态存储器120，例如至少一个储存设备存储器件、闪存器件或其他非暂态固态存储器件。在一些实施方式中，存储器120可选包括相对于处理器110远程设置的存储器120，这些远程存储器120可以通过网络连接至该电子设备100。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 120 may include a program storage area and a data storage area, where the program storage area may store an operating system and an application program required for at least one function; the storage data area may store the above-mentioned exception event handling method. Additionally, memory 120 may include high-speed random access memory 120 and may also include non-transitory memory 120, such as at least one storage device storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 120 optionally includes memory 120 located remotely relative to the processor 110 , and these remote memories 120 can be connected to the electronic device 100 through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

实现上述的异常事件处理方法所需的非暂态软件程序以及指令存储在存储器120中，当被一个或者多个处理器110执行时，执行上述的异常事件处理方法，例如，执行图1中的方法步骤S101至步骤S103、图2中的方法步骤S201至步骤S202、图3中的方法步骤S301至步骤S303、图4中的方法步骤S401至步骤S403、图6中的方法步骤S501至步骤S502、图7中的方法步骤S601至步骤S602、图8中的方法步骤S701至步骤S703、图11中的方法步骤S801至步骤S802、图12中的方法步骤S901至步骤S902、图13中的方法步骤S1001至步骤S1003、图14中的方法步骤S1101至步骤S1103、图15中的方法步骤S1201至步骤S1202。The non-transitory software programs and instructions required to implement the above-mentioned exception event processing method are stored in the memory 120. When executed by one or more processors 110, the above-mentioned exception event processing method is executed, for example, execution in Figure 1 Method steps S101 to step S103, method steps S201 to step S202 in Figure 2, method steps S301 to step S303 in Figure 3, method steps S401 to step S403 in Figure 4, method steps S501 to step S502 in Figure 6 , the method steps S601 to step S602 in Figure 7, the method steps S701 to step S703 in Figure 8, the method steps S801 to step S802 in Figure 11, the method steps S901 to step S902 in Figure 12, the method in Figure 13 Steps S1001 to step S1003, method steps S1101 to step S1103 in FIG. 14 , and method steps S1201 to step S1202 in FIG. 15 .

本发明实施例还提供了计算机可读存储介质，存储有计算机可执行指令，计算机可执行指令用于执行上述的异常事件处理方法。Embodiments of the present invention also provide a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the above-mentioned abnormal event processing method.

在一实施例中，该计算机可读存储介质存储有计算机可执行指令，该计算机可执行指令被一个或多个控制处理器执行，例如，执行图1中的方法步骤S101至步骤S103、图2中的方法步骤S201至步骤S202、图3中的方法步骤S301至步骤S303、图4中的方法步骤S401至步骤S403、图6中的方法步骤S501至步骤S502、图7中的方法步骤S601至步骤S602、图8中的方法步骤S701至步骤S703、图11中的方法步骤S801至步骤S802、图12中的方法步骤S901至步骤S902、图13中的方法步骤S1001至步骤S1003、图14中的方法步骤S1101至步骤S1103、图15中的方法步骤S1201至步骤S1202。In one embodiment, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more control processors, for example, executing steps S101 to S103 of the method in Figure 1, Figure 2 The method steps S201 to S202 in Figure 3 , the method steps S301 to S303 in Figure 3 , the method steps S401 to S403 in Figure 4 , the method steps S501 to S502 in Figure 6 , the method steps S601 to S601 in Figure 7 Step S602, method steps S701 to step S703 in Figure 8, method steps S801 to step S802 in Figure 11, method steps S901 to step S902 in Figure 12, method steps S1001 to step S1003 in Figure 13, method steps S1001 to step S1003 in Figure 14 method steps S1101 to step S1103, and method steps S1201 to step S1202 in Figure 15 .

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器，如中央处理器、数字信号处理器或微处理器执行的软件，或者被实施为硬件，或者被实施为集成电路，如专用集成电路。这样的软件可以分布在计算机可读介质上，计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的，计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、储存设备存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外，本领域普通技术人员公知的是，通信介质通常包括计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据，并且可包括任何信息递送介质。Those of ordinary skill in the art can understand that all or some steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As well known to those of ordinary skill in the art, computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. , removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, storage device storage or other magnetic storage devices, or Any other medium that can be used to store the desired information and that can be accessed by a computer. Furthermore, it is known to those of ordinary skill in the art that communication media typically includes computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

还应了解，本发明实施例提供的各种实施方式可以任意进行组合，以实现不同的技术效果。It should also be understood that the various implementation modes provided by the embodiments of the present invention can be combined arbitrarily to achieve different technical effects.

以上是对本发明的较佳实施进行了具体说明，但本发明并不局限于上述实施方式，熟悉本领域的技术人员在不违背本发明精神的共享条件下还可作出种种等同的变形或替换，这些等同的变形或替换均包括在本发明权利要求所限定的范围内。The above is a detailed description of the preferred implementation of the present invention, but the present invention is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent modifications or substitutions without violating the spirit of the present invention. These equivalent modifications or substitutions are included in the scope defined by the claims of the present invention.

Claims

1. A method of exception event handling, the method comprising:

acquiring a plurality of abnormal events of a target position in a preset time period, wherein the abnormal events comprise at least one of alarms, key performance index anomalies and operation logs;

determining an aggregation point in the abnormal event;

and polymerizing according to the polymerization point and the abnormal event to obtain a polymerization result.

2. The abnormal event processing method according to claim 1, wherein the acquiring a plurality of abnormal events of the target position within a preset period of time includes:

establishing a plurality of time barrels in a target cache region according to the total duration of a preset time period, wherein the time barrels are formed by time stamp intervals, and the duration of each time barrel is the same and the time of two adjacent time barrels is continuous;

And continuously acquiring a plurality of abnormal events of the target position, and caching the abnormal events in the time barrel of the corresponding time according to the acquisition time of each abnormal event.

3. The method according to claim 2, wherein the sequentially acquiring a plurality of abnormal events of a target position and buffering the acquired time of each abnormal event in the time bucket at a corresponding time comprises:

acquiring a convergence condition for stopping caching the abnormal event within the preset time period;

continuously acquiring a plurality of abnormal events of a target position from the starting time of the preset time period, and caching the abnormal events in the time barrel of the corresponding time according to the acquisition time of each abnormal event;

and stopping caching the abnormal event when the cached abnormal event meets the convergence condition.

4. The method of claim 3, wherein the convergence condition comprises at least one of:

acquiring the ending time of the abnormal event exceeding the preset time period;

the quantity decrement rate of caching the abnormal events among the continuous time buckets is smaller than a preset target decrement rate;

And the number of the abnormal events in the time bucket is smaller than a preset minimum threshold value of the number of the events in the bucket.

5. The method for processing an abnormal event according to claim 3, wherein the plurality of target locations are provided, the establishing a plurality of time slots in the target buffer according to a total duration of a preset time period includes:

respectively acquiring preset time periods corresponding to the target positions;

and respectively establishing target cache areas corresponding to the target positions, and respectively establishing a plurality of time barrels in the corresponding target cache areas according to the total duration of the preset time periods.

6. The method of claim 1, wherein determining an aggregation point in the exception event comprises:

obtaining screening conditions of the polymerization points;

and determining that the abnormal event meeting the screening condition is the aggregation point in a plurality of abnormal events.

7. The method for processing an abnormal event according to claim 1, wherein the aggregating according to the aggregation point and the abnormal event to obtain an aggregate result comprises:

determining a first target event and a second target event in the abnormal event, wherein the first target event is characterized as a noise event of the aggregation point, and the second target event is characterized as an association event of the aggregation point;

Clearing the first target event and retaining the second target event;

and polymerizing according to the polymerization point and the second target event to obtain a polymerization result.

8. The method for processing an abnormal event according to claim 1 or 7, wherein the aggregating the abnormal event according to the aggregation point and the abnormal event to obtain an aggregation result, includes:

the aggregation point and the abnormal event are aggregated to obtain an aggregation packet;

and carrying out root cause identification on the aggregation package, and obtaining root cause identification results of the aggregation points by combining the abnormal events corresponding to the aggregation points.

9. The method of claim 7, wherein determining a first target event and a second target event among the abnormal events comprises:

initializing the abnormal event to obtain initial data, inputting the initial data into a preset bidirectional aggregation model for probability calculation, and respectively obtaining noise probability values of each abnormal event and the corresponding aggregation point;

and determining a first target event and a second target event in the abnormal events according to the noise probability value.

10. The method of claim 9, wherein determining a first target event and a second target event of the abnormal events according to the noise probability value comprises:

acquiring a first probability threshold and a second probability threshold of each aggregation point;

determining the abnormal event corresponding to the noise probability value lower than all the first probability threshold values as a first target event;

and determining the abnormal event corresponding to the noise probability value higher than any one of the second probability thresholds as a second target event.

11. The method for processing an abnormal event according to claim 9, wherein initializing the abnormal event to obtain initial data, and inputting the initial data into a preset bidirectional aggregation model to perform probability calculation, so as to obtain noise probability values of each abnormal event and the corresponding aggregation point, respectively, including:

performing single-heat coding on the abnormal event to obtain initialized initial vector data;

acquiring a preset bidirectional aggregation model, wherein the bidirectional aggregation model is obtained after unsupervised training according to a sample abnormal event in an acquired sample, a sample target event which is characterized as a noise event and a sample aggregation point;

And inputting the initial vector data into a preset bidirectional aggregation model to perform probability calculation, and respectively obtaining noise probability values of the abnormal events and the corresponding aggregation points.

12. The method for processing an abnormal event according to claim 9, wherein the inputting the initial data into a preset bidirectional aggregation model for probability calculation, respectively obtaining noise probability values of each abnormal event and the corresponding aggregation point, includes:

sorting a plurality of aggregation points according to time and storing the aggregation points in an aggregation point list;

and inputting the initial data into a preset bidirectional aggregation model, and respectively carrying out probability calculation on the abnormal events according to each aggregation point in the aggregation point list to obtain noise probability values of each abnormal event and the corresponding aggregation point.

13. An electronic device, comprising: a memory, a processor storing a computer program, the processor executing the computer program implementing the abnormal event handling method according to any one of claims 1 to 12.

14. A computer-readable storage medium storing a program that is executed by a processor to implement the abnormal event processing method according to any one of claims 1 to 12.