CN107171819B

CN107171819B - Network fault diagnosis method and device

Info

Publication number: CN107171819B
Application number: CN201610128713.5A
Authority: CN
Inventors: 吴俊�; 张亮; 李世昊; 关耀东
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-03-07
Filing date: 2016-03-07
Publication date: 2020-02-14
Anticipated expiration: 2036-03-07
Also published as: CN107171819A

Abstract

The embodiments of the present invention provide a method and device for diagnosing network faults, which can automatically locate network faults without manual intervention, and can determine the root cause of the fault, thereby realizing automatic fault diagnosis. The method includes: collecting log, alarm, configuration and KPI data of network elements, network management equipment and monitoring equipment, and detecting abnormal network elements and abnormal information according to the collected data; determining a first fault event according to the abnormal information and a pre-stored corresponding relationship; Determine the first fault rule according to the first fault event and the pre-stored fault rule library; collect real-time data of abnormal network elements, and confirm the suspected fault event and the second fault event in the first fault event, according to the suspected fault event, The confirmation result of the second fault event and the fault rule are logically calculated to determine the root cause of the network system fault. The present invention is applicable to the field of computer technology.

Description

A kind of network fault diagnosis method and device

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种网络故障诊断方法及装置。The present invention relates to the field of computer technology, and in particular, to a network fault diagnosis method and device.

背景技术Background technique

随着信息技术的快速发展，网络系统的规模不断扩大，复杂程度也越来越高，这使得传统的通过人工查看系统日志定位网络故障的诊断方法已不再适用。With the rapid development of information technology, the scale and complexity of network systems continue to expand, which makes the traditional method of locating network faults by manually viewing system logs no longer applicable.

目前，对于网络故障的诊断，可利用大数据技术从海量日志记录中提取出相关特征，进而运用机器学习算法对这些特征进行统计分析，便可快速检测出故障。由于机器学习算法是概率性算法，因此检测得到故障仅是疑似故障，还需通过人工分析日志以确认故障。此外，受到长度的限制，单条日志所能记录的信息有限，例如，通常会记录某个业务所出现的事件，但更细节的信息，比如网元的实时状态信息等，则往往不会记录。而这些细粒度信息的缺失，可能导致无法找到导致网络发生故障的根因。也就是说，通过人工分析日志也只能确认故障，而无法保证能够找到导致网络发生故障的根因。At present, for the diagnosis of network faults, big data technology can be used to extract relevant features from massive log records, and then machine learning algorithms can be used to perform statistical analysis on these features to quickly detect faults. Since the machine learning algorithm is a probabilistic algorithm, the detected faults are only suspected faults, and manual analysis of the logs is required to confirm the faults. In addition, due to the limitation of length, the information that can be recorded in a single log is limited. For example, events that occur in a certain service are usually recorded, but more detailed information, such as real-time status information of network elements, is often not recorded. The lack of such fine-grained information may lead to failure to find the root cause of network failures. That is to say, manual analysis of logs can only confirm the failure, but cannot guarantee that the root cause of the network failure can be found.

综上所述，现有的网络故障诊断方法需人工介入以确认故障，并且只能确认故障而无法得出引起故障的根因。To sum up, the existing network fault diagnosis method requires manual intervention to confirm the fault, and can only confirm the fault but cannot obtain the root cause of the fault.

发明内容SUMMARY OF THE INVENTION

为此，本发明实施例提供了一种网络故障诊断方法及装置，无需人工介入即可实现网络故障的自动定位，并且能够确定故障根因，实现了自动化故障诊断，提高了故障诊断效率。To this end, the embodiments of the present invention provide a network fault diagnosis method and device, which can automatically locate network faults without manual intervention, and can determine the root cause of the fault, realize automatic fault diagnosis, and improve fault diagnosis efficiency.

为达到上述目的，本发明的实施例采用如下技术方案：To achieve the above object, the embodiments of the present invention adopt the following technical solutions:

第一方面，提供一种网络故障诊断方法，应用于网络系统，该网络系统包括网元、网管设备、监控设备以及网络故障诊断装置，方法包括：A first aspect provides a method for diagnosing network faults, which is applied to a network system. The network system includes network elements, network management equipment, monitoring equipment, and a network fault diagnosis device. The method includes:

网络故障诊断装置获取网元、网管设备及监控设备的日志、告警、配置及KPI数据，并根据所采集的数据检测异常网元以及异常信息；The network fault diagnosis device obtains the log, alarm, configuration and KPI data of network elements, network management equipment and monitoring equipment, and detects abnormal network elements and abnormal information according to the collected data;

网络故障诊断装置根据异常信息以及预存的、异常信息与故障事件的对应关系，确定与异常信息对应的第一故障事件；The network fault diagnosis device determines the first fault event corresponding to the abnormal information according to the abnormal information and the pre-stored correspondence between the abnormal information and the fault event;

网络故障诊断装置根据第一故障事件以及预存的故障规则库，确定与第一故障事件对应的第一故障规则，故障规则库包括至少一个故障规则，每个故障规则包括至少两个故障事件以及至少两个故障事件之间的逻辑因果关系；The network fault diagnosis apparatus determines a first fault rule corresponding to the first fault event according to the first fault event and a pre-stored fault rule base, the fault rule base includes at least one fault rule, and each fault rule includes at least two fault events and at least one fault rule. Logical causal relationship between two fault events;

网络故障诊断装置采集异常网元的实时数据，根据异常网元的实时数据，分别对第一故障事件中的疑似故障事件及第二故障事件进行确认，第二故障事件为第一故障规则所包括的至少两个故障事件中除第一故障事件之外的故障事件，并根据确认结果及第一故障规则进行逻辑计算，确定引起网络系统发生故障的根因。The network fault diagnosis device collects the real-time data of the abnormal network element, and according to the real-time data of the abnormal network element, respectively confirms the suspected fault event and the second fault event in the first fault event, and the second fault event is included in the first fault rule In addition to the first fault event among the at least two fault events, logical calculation is performed according to the confirmation result and the first fault rule to determine the root cause of the failure of the network system.

优选的，在网络故障诊断装置采集网元、网管设备及监控设备的日志、告警、配置及KPI数据之前，还可以包括：Preferably, before the network fault diagnosis apparatus collects logs, alarms, configurations and KPI data of network elements, network management equipment and monitoring equipment, it may further include:

网络故障诊断装置获取网络系统可能出现的异常信息以及预存的故障规则库中的每个故障规则所包括的故障事件；The network fault diagnosis device acquires abnormal information that may occur in the network system and fault events included in each fault rule in the pre-stored fault rule base;

网络故障诊断装置分别将网络系统可能出现的异常信息及故障规则库中的每个故障规则所包括的故障事件进行抽象，得到异常信息对应的故障行为及故障事件对应的故障行为；The network fault diagnosis device abstracts the possible abnormal information of the network system and the fault events included in each fault rule in the fault rule base, and obtains the fault behavior corresponding to the abnormal information and the fault behavior corresponding to the fault event;

网络故障诊断装置根据异常信息对应的故障行为及故障事件对应的抽象行为，建立并存储异常信息与故障事件的对应关系。The network fault diagnosis apparatus establishes and stores the corresponding relationship between the abnormal information and the fault event according to the fault behavior corresponding to the abnormal information and the abstract behavior corresponding to the fault event.

优选的，在网络故障诊断装置根据疑似故障事件、第二故障事件的确认结果以及故障规则进行逻辑计算，确定引起网络系统发生故障的根因之后，还可以包括：Preferably, after the network fault diagnosis device performs logical calculation according to the suspected fault event, the confirmation result of the second fault event and the fault rule, and determines the root cause of the fault of the network system, it may further include:

网络故障诊断装置根据根因生成对应的故障恢复脚本，并向异常网元或网管设备发送故障恢复脚本，以使异常网元或网管设备根据故障恢复脚本修复网络系统发生的故障。The network fault diagnosis device generates a corresponding fault recovery script according to the root cause, and sends the fault recovery script to the abnormal network element or the network management device, so that the abnormal network element or the network management device can repair the fault occurred in the network system according to the fault recovery script.

如此，在发现故障根因后，针对该故障根因，生成对对应的恢复脚本，并发送给相关设备以修复该故障，以使网络系统恢复正常，这样一来，无需人工介入即可实现网络故障的自动修复。In this way, after the root cause of the failure is found, a corresponding recovery script is generated for the root cause of the failure, and sent to the relevant equipment to repair the failure, so that the network system can be restored to normal. In this way, the network can be realized without manual intervention. Automatic repair of failures.

优选的，在网络故障诊断装置根据异常网元的实时数据，分别对第一故障事件中的疑似故障事件及第二故障事件进行确认之后，还可以包括：Preferably, after the network fault diagnosis device respectively confirms the suspected fault event and the second fault event in the first fault event according to the real-time data of the abnormal network element, it may further include:

网络故障诊断装置获取当前故障诊断过程中确认的故障事件以及历史故障事件，根据当前故障诊断过程中确认的故障事件以及历史故障事件，历史故障事件为网络故障诊断装置在之前的故障诊断过程中确认的故障事件，挖掘新的故障规则，并将新的故障规则存储至故障规则库中。The network fault diagnosis device obtains the fault events and historical fault events confirmed in the current fault diagnosis process. According to the fault events and historical fault events confirmed in the current fault diagnosis process, the historical fault events are confirmed by the network fault diagnosis device in the previous fault diagnosis process. The fault events are discovered, new fault rules are mined, and the new fault rules are stored in the fault rule base.

基于上述方案，可积累每次故障诊断的经验，进而根据积累的经验发现当前故障规则库未覆盖的故障规则，因此可达到提高故障定位的精度、扩大故障定位的广度的目的。Based on the above solution, the experience of each fault diagnosis can be accumulated, and then the fault rules not covered by the current fault rule base can be found according to the accumulated experience, so the purpose of improving the accuracy of fault location and expanding the breadth of fault location can be achieved.

第二方面，提供一种网络故障诊断装置，网络故障诊断装置应用于网络系统，网络系统还包括网元、网管设备以及监控设备，网络故障诊断装置包括：数据获取模块、故障发现模块、事件映射模块以及故障确诊模块；In a second aspect, a network fault diagnosis apparatus is provided. The network fault diagnosis apparatus is applied to a network system. The network system further includes network elements, network management equipment, and monitoring equipment. The network fault diagnosis apparatus includes: a data acquisition module, a fault discovery module, and an event mapping. module and fault diagnosis module;

数据获取模块，用于获取网元、网管设备及监控设备的日志、告警、配置及KPI数据；The data acquisition module is used to acquire the log, alarm, configuration and KPI data of network elements, network management equipment and monitoring equipment;

故障发现模块，用于根据日志、告警、配置及KPI数据检测异常网元以及异常信息；The fault discovery module is used to detect abnormal network elements and abnormal information according to the log, alarm, configuration and KPI data;

事件映射模块，用于根据异常信息以及预存的对应关系，得到异常信息对应的第一故障事件，预存的对应关系为异常信息与故障事件的对应关系；The event mapping module is used to obtain the first fault event corresponding to the abnormal information according to the abnormal information and the pre-stored corresponding relationship, and the pre-stored corresponding relationship is the corresponding relationship between the abnormal information and the fault event;

故障确诊模块，用于根据第一故障事件以及预存的故障规则库，确定与第一故障事件对应的第一故障规则；其中，故障规则库包括至少一个故障规则，每个故障规则包括至少两个故障事件以及至少两个故障事件之间的逻辑关系；A fault diagnosis module, configured to determine a first fault rule corresponding to the first fault event according to the first fault event and a pre-stored fault rule base; wherein the fault rule base includes at least one fault rule, and each fault rule includes at least two The fault event and the logical relationship between at least two fault events;

数据采集模块，还用于采集异常网元的实时数据；The data collection module is also used to collect real-time data of abnormal network elements;

故障确诊模块，还用于根据异常网元的实时数据，分别对第一故障事件中的疑似故障事件及第二故障事件进行确认，并根据疑似故障事件、第二故障事件的确认结果以及故障规则进行逻辑计算，确定引起网络系统发生故障的根因，其中，第二故障事件为第一故障规则所包括的至少两个故障事件中除第一故障事件之外的故障事件。The fault diagnosis module is also used to confirm the suspected fault event and the second fault event in the first fault event respectively according to the real-time data of the abnormal network element, and according to the confirmation result of the suspected fault event and the second fault event and the fault rule A logical calculation is performed to determine the root cause of the failure of the network system, wherein the second failure event is a failure event other than the first failure event among the at least two failure events included in the first failure rule.

第三方面，提供一种网络故障诊断装置，包括：In a third aspect, a network fault diagnosis device is provided, including:

处理器，用于执行第一方面提供的网络故障诊断方法。The processor is configured to execute the network fault diagnosis method provided by the first aspect.

现有的网络故障诊断方法需人工介入以确认故障，并且由于单条日志所能记录的信息有限，单条日志往往不会记录细粒度信息，因此可能无法找到导致网络发生故障的根因。而基于本发明实施例提供的网络故障诊断方法及装置，通过采集网络系统相关设备的日志、告警、配置及KPI数据，根据所采集数据检测异常网元及异常信息，并根据异常信息及预存的对应关系确定每条异常信息所对应的第一故障事件，进而根据第一故障事件及预存的故障规则库确定故障规则，并通过采集异常网元的实时数据，利用异常网元的实时数据对相关的疑似故障事件进行确认，同时根据故障规则进行逻辑计算，即可确定引起网络系统发生故障的根因。由于通过异常信息与故障事件的对应关系即可将检测得到的异常信息自动映射为相关的故障事件，同时，根据具体的故障事件采集相关网元的实时数据，利用网元的实时数据即可对故障事件进行确认，进而根据确认结果以及相关的故障规则进行逻辑计算即可排除误报故障，同时对真实故障进行根因定位。可见，基于本发明实施例提供的网络故障诊断方法及装置，无需人工介入即可实现网络故障的自动定位，并且能够确定故障根因，从而实现自动化故障诊断，提高故障诊断效率。The existing network fault diagnosis methods require manual intervention to confirm the fault, and because the information that can be recorded in a single log is limited, a single log often does not record fine-grained information, so the root cause of the network failure may not be found. Based on the method and device for diagnosing network faults provided by the embodiments of the present invention, by collecting logs, alarms, configurations, and KPI data of related equipment in the network system, abnormal network elements and abnormal information are detected according to the collected data, and according to the abnormal information and pre-stored data, abnormal network elements and abnormal information are detected. The corresponding relationship determines the first fault event corresponding to each abnormal information, and then determines the fault rule according to the first fault event and the pre-stored fault rule library, and collects the real-time data of the abnormal network element, and uses the real-time data of the abnormal network element to compare the relevant information. The suspected fault events are confirmed, and logical calculation is performed according to the fault rules, so as to determine the root cause of the network system failure. The detected abnormal information can be automatically mapped to the relevant fault events through the corresponding relationship between the abnormal information and the fault events. The fault event is confirmed, and then logical calculation is carried out according to the confirmation result and the relevant fault rules to eliminate the false alarm fault, and at the same time, the root cause of the real fault can be located. It can be seen that, based on the network fault diagnosis method and device provided by the embodiments of the present invention, automatic network fault location can be realized without manual intervention, and the root cause of the fault can be determined, thereby realizing automatic fault diagnosis and improving fault diagnosis efficiency.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例中的网络系统的架构示意图；FIG. 1 is a schematic structural diagram of a network system in an embodiment of the present invention;

图2为本发明实施例提供的一种网络故障诊断方法的流程示意图；FIG. 2 is a schematic flowchart of a network fault diagnosis method according to an embodiment of the present invention;

图3为本发明实施例提供的一种故障规则的组成示意图；3 is a schematic diagram of the composition of a fault rule provided by an embodiment of the present invention;

图4为本发明实施例提供的另一种网络故障诊断方法的流程示意图；FIG. 4 is a schematic flowchart of another network fault diagnosis method provided by an embodiment of the present invention;

图5为本发明实施例提供的又一种网络故障诊断方法的流程示意图；FIG. 5 is a schematic flowchart of still another network fault diagnosis method provided by an embodiment of the present invention;

图6为本发明实施例提供的又一种网络故障诊断方法的流程示意图；6 is a schematic flowchart of still another network fault diagnosis method provided by an embodiment of the present invention;

图7为本发明实施例提供的一种建立异常信息与故障事件的对应关系的方法的示意图；FIG. 7 is a schematic diagram of a method for establishing a correspondence between abnormal information and fault events according to an embodiment of the present invention;

图8为本发明实施例提供的一种网络故障诊断装置的结构示意图；FIG. 8 is a schematic structural diagram of a network fault diagnosis apparatus according to an embodiment of the present invention;

图9为本发明实施例提供的另一种网络故障诊断装置的结构示意图；FIG. 9 is a schematic structural diagram of another network fault diagnosis apparatus provided by an embodiment of the present invention;

图10为本发明实施例提供的又一种网络故障诊断装置的结构示意图；FIG. 10 is a schematic structural diagram of still another network fault diagnosis apparatus provided by an embodiment of the present invention;

图11为本发明实施例提供的一种网络故障诊断装置的结构示意图。FIG. 11 is a schematic structural diagram of a network fault diagnosis apparatus according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

需要说明的是，为了便于清楚描述本发明实施例的技术方案，在本发明下述各实施例中，采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分，本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定。It should be noted that, in order to clearly describe the technical solutions of the embodiments of the present invention, in the following embodiments of the present invention, words such as "first" and "second" are used to describe the same items or items with basically the same functions and functions. Similar items are distinguished, and those skilled in the art can understand that words such as "first" and "second" do not limit the quantity and execution order.

另外，还需说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互结合。本领域普通技术人员可以理解，本申请实施例中示出的示例为本发明为便于读者理解所作的示意性的说明，并不构成对本发明的限定。In addition, it should also be noted that the embodiments of the present application and the features of the embodiments may be combined with each other in the case of no conflict. Those of ordinary skill in the art can understand that the examples shown in the embodiments of the present application are schematic descriptions of the present invention for the convenience of readers' understanding, and do not constitute a limitation of the present invention.

首先，为便于理解本发明实施例下述的网络故障诊断方法，先对其应用环境-网络系统，进行简要介绍如下：First, in order to facilitate the understanding of the following network fault diagnosis methods in the embodiments of the present invention, the application environment-network system is briefly introduced as follows:

图1所示为所述网络系统的架构图。参见图1，所述网络系统10包括网元101、网管设备102、监控设备103以及网络故障诊断装置104。其中，网元101指能够独立完成一种或几种功能的网络设备或实体，如路由器、交换机等；网管设备102则主要用于对网元101进行全面管理，例如，网管设备102可通过算法快速自动搜索网元101，并实时显示网络资源的链路关系和运行状态，实时监测网元101的核心参数，如监测路由器及交换机的端口流量、端口使用率、内存使用率、路由表等，监测服务器的运行状态、启动情况、内存、磁盘、进程、服务等指标；而监控设备103则主要用于对网络的应用系统及应用系统的运行状况进行监测；网络故障诊断装置用于执行本发明实施例下述的网络故障诊断方法以对网络系统进行故障诊断，其可能是配置于网元101或网管设备102之上的装置，也可能是独立于网元101及网管设备102、并与网元101、网管设备102及监控设备103可以通信的装置，如图1所示，本发明实施例对此不作具体限定。FIG. 1 is an architecture diagram of the network system. Referring to FIG. 1 , the network system 10 includes a network element 101 , a network management device 102 , a monitoring device 103 and a network fault diagnosis apparatus 104 . Among them, the network element 101 refers to a network device or entity that can independently perform one or several functions, such as routers, switches, etc.; the network management device 102 is mainly used to comprehensively manage the network element 101. Quickly and automatically search for network elements 101, and display the link relationship and running status of network resources in real time, and monitor the core parameters of network elements 101 in real time, such as monitoring the port traffic of routers and switches, port usage, memory usage, routing table, etc. Monitoring the running status, startup status, memory, disk, process, service and other indicators of the server; and the monitoring device 103 is mainly used to monitor the network application system and the operating status of the application system; the network fault diagnosis device is used to implement the present invention. Embodiments The following network fault diagnosis method is used to diagnose the network system, which may be a device configured on the network element 101 or the network management device 102, or may be independent of the network element 101 and the network management device 102, and connected to the network. The device 101, the network management device 102, and the monitoring device 103 can communicate with each other, as shown in FIG. 1, which is not specifically limited in this embodiment of the present invention.

基于图1所示的网络系统10，本发明实施例提供一种网络故障诊断方法，应用于图1所示的网络系统10，如图2所示，包括：Based on the network system 10 shown in FIG. 1 , an embodiment of the present invention provides a network fault diagnosis method, which is applied to the network system 10 shown in FIG. 1 , as shown in FIG. 2 , including:

S201、网络故障诊断装置104获取网元101、网管设备102及监控设备103的日志、告警、配置及关键绩效指标(Key Performance Indicator，KPI)数据，并根据日志、告警、配置及KPI数据检测异常网元以及异常信息。S201. The network fault diagnosis apparatus 104 acquires the log, alarm, configuration and key performance indicator (Key Performance Indicator, KPI) data of the network element 101, the network management device 102 and the monitoring device 103, and detects abnormality according to the log, alarm, configuration and KPI data NE and exception information.

其中，需要说明的是，可通过网元101、网管设备102及监控设备103周期性地采集日志、告警、配置及KPI数据，并将所采集数据主动上报给网络故障诊断装置104，实现数据获取。Among them, it should be noted that the log, alarm, configuration and KPI data can be collected periodically through the network element 101, the network management equipment 102 and the monitoring equipment 103, and the collected data can be actively reported to the network fault diagnosis device 104 to realize data acquisition .

另外，还需说明的是，所述异常网元是指日志、告警、配置或KPI数据出现异常的网元，所述的异常信息即指异常网元所出现的异常，具体可以包括异常网元的名称或IP地址、网元类型、出现异常的时间、对应的业务等等，本发明实施例对此不作具体限定。In addition, it should be noted that the abnormal network element refers to the network element with abnormal log, alarm, configuration or KPI data, and the abnormal information refers to the abnormality of the abnormal network element, which may specifically include the abnormal network element. Name or IP address, network element type, time when an exception occurs, corresponding service, etc., which are not specifically limited in this embodiment of the present invention.

本领域普通技术人员可以理解，由于网络系统某一业务的实现通常需要多个网元101协同完成，同时单个网元101可能存在多种业务，因此当网络系统中的某个网元101出现故障后，网络系统中的其他相关网元101或系统的某些参数也会因此受到影响而出现异常。例如，当实现开放最短路径优先(Open Shortest Path First，OSPF)路由协议业务的网元A出现故障后，协同实现OSPF路由协议业务的网元B可能因此受到影响而表现出异常，同时，网元A的另一业务-虚拟专用网络(Virtual Private Network，VPN)业务，也可能因此受到影响而表现出异常。因此，步骤S201中网络故障诊断装置104根据所采集的数据检测到的异常网元通常包括多个网元，异常信息也包括多条信息。Those of ordinary skill in the art can understand that, since the realization of a certain service in the network system usually requires multiple network elements 101 to complete cooperatively, and at the same time, a single network element 101 may have multiple services, so when a certain network element 101 in the network system fails Afterwards, other related network elements 101 in the network system or some parameters of the system will also be affected and abnormal. For example, when NE A that implements the Open Shortest Path First (OSPF) routing protocol service fails, NE B that cooperates to implement the OSPF routing protocol service may be affected and behave abnormally. Another service of A, a virtual private network (Virtual Private Network, VPN) service, may also be affected and behave abnormally. Therefore, the abnormal network elements detected by the network fault diagnosis apparatus 104 according to the collected data in step S201 usually include multiple network elements, and the abnormal information also includes multiple pieces of information.

S202、网络故障诊断装置104根据异常信息以及预存的对应关系，确定与异常信息对应的第一故障事件。S202. The network fault diagnosis apparatus 104 determines a first fault event corresponding to the abnormal information according to the abnormal information and the pre-stored corresponding relationship.

其中，预存的对应关系为异常信息与故障规则中的故障事件的对应关系，通过该对应关系可将异常事件转换为故障规则中的故障事件，以便于后续进行故障诊断及修复。The pre-stored corresponding relationship is the corresponding relationship between the abnormal information and the fault event in the fault rule, and the abnormal event can be converted into the fault event in the fault rule through the corresponding relationship, so as to facilitate subsequent fault diagnosis and repair.

具体的，所述的异常信息与故障事件的对应关系可描述为如下所示的表1，其中第一列为异常信息，第二列即为与异常信息对应的故障事件。例如，当检测到异常信息为OSPF流量下降时，对应的故障事件即为OSPF流量下降。Specifically, the corresponding relationship between the abnormal information and the fault event can be described as the following Table 1, wherein the first column is the abnormal information, and the second column is the fault event corresponding to the abnormal information. For example, when the abnormal information is detected as OSPF traffic drop, the corresponding fault event is OSPF traffic drop.

表1Table 1

容易理解，由于步骤S201中检测出的异常信息通常为多条，因此根据异常信息确定的第一故障事件也为多个。It is easy to understand that since there are usually multiple pieces of abnormal information detected in step S201, there are also multiple first fault events determined according to the abnormal information.

S203、网络故障诊断装置104根据第一故障事件以及预存的故障规则库，确定与第一故障事件对应的第一故障规则。S203. The network fault diagnosis apparatus 104 determines a first fault rule corresponding to the first fault event according to the first fault event and a pre-stored fault rule library.

其中，故障规则库包括至少一个故障规则，每个故障规则包括至少两个故障事件之间的逻辑因果关系。示例性的，故障规则具体可以是故障树、或决策书、或贝叶斯网络等，本发明实施例对此不作具体限定。The fault rule base includes at least one fault rule, and each fault rule includes a logical causal relationship between at least two fault events. Exemplarily, the fault rule may specifically be a fault tree, a decision book, or a Bayesian network, etc., which is not specifically limited in this embodiment of the present invention.

本领域普通技术人员容易理解，某一故障事件对应的故障规则可能是一条也可能是多条，例如，某一网元出现的某一故障事件可能仅出现在故障树A中，也可能既出现在故障树A中，又出现在故障树B中。因此，步骤S203中，所确定的故障规则可能是一条，也可能是一组。同时，由于步骤S202中得到的故障事件为多个，因此步骤S203中所确定的故障规则往往是一组。Those skilled in the art can easily understand that there may be one or more fault rules corresponding to a fault event. For example, a fault event that occurs on a certain network element may only appear in fault tree A, or may appear in both In fault tree A, it appears in fault tree B again. Therefore, in step S203, the determined fault rule may be one or a group. At the same time, since there are multiple fault events obtained in step S202, the fault rules determined in step S203 are often a group.

S204、网络故障诊断装置104采集异常网元的实时数据，根据异常网元的实时数据，分别对第一故障事件中的疑似故障事件及第二故障事件进行确认，并根据疑似故障事件、第二故障事件的确认结果以及故障规则进行逻辑计算，确定引起网络系统发生故障的根因。S204. The network fault diagnosis device 104 collects the real-time data of the abnormal network element, and according to the real-time data of the abnormal network element, respectively confirms the suspected fault event and the second fault event in the first fault event, and according to the suspected fault event, the second fault event The confirmation result of the fault event and the fault rule are logically calculated to determine the root cause of the network system fault.

其中，需要说明的是，所述第二故障事件为N个故障事件中除第一故障事件之外的故障事件。所述实时数据具体可以是网元的实时状态或性能数据，可针对具体的待确认故障事件，通过向相应的异常网元发送相关的查询命令实现对网元实时数据的采集。It should be noted that the second fault event is a fault event other than the first fault event among the N fault events. The real-time data may specifically be real-time status or performance data of network elements, and for specific fault events to be confirmed, the real-time data of network elements can be collected by sending relevant query commands to corresponding abnormal network elements.

另外，还需说明的是，本领域普通技术人员可以理解，在检测异常信息时所用到的机器学习算法为概率性算法，其通过预设的灵敏度阈值来检测异常信息，因此若灵敏度阈值设置的太高，则可能导致漏报故障，而若灵敏度阈值设置的太低，又可能所导致误报故障。因此，根据检测得到的异常信息所确定的故障事件属于疑似故障。当然，对于某些特殊的故障事件，如，业务的流量下降，硬件的内存使用率升高等，由于不存在误报的可能，因此不属于疑似故障事件。在本发明实施例中，根据异常信息所确定的第一故障事件中的部分故障事件可能为检测算法灵敏度阈值太低所导致的误报故障，因而需要根据网元101的实时状态或性能数据确认这部分故障事件是否确为真实故障。In addition, it should be noted that those of ordinary skill in the art can understand that the machine learning algorithm used in detecting abnormal information is a probabilistic algorithm, which detects abnormal information through a preset sensitivity threshold. Therefore, if the sensitivity threshold is set to If the sensitivity threshold is set too low, it may cause false alarm failures. Therefore, the fault event determined according to the detected abnormal information belongs to the suspected fault. Of course, for some special fault events, such as a decrease in business traffic and an increase in hardware memory usage, there is no possibility of false positives, so they are not suspected fault events. In this embodiment of the present invention, some of the first fault events determined according to the abnormal information may be false alarm faults caused by the detection algorithm sensitivity threshold being too low, and therefore need to be confirmed according to the real-time status or performance data of the network element 101 Whether this part of the fault event is indeed a real fault.

容易理解，若步骤S203中所确定的故障规则只有一条，则仅需根据第一故障事件中的疑似故障事件、第二故障事件的确认结果，结合该条故障规则进行逻辑计算即可；若步骤S203中所确定的故障规则有多条，则需根据第一故障事件中的疑似故障事件、第二故障事件的确认结果，依次结合其中的每条故障规则进行逻辑计算以确定引起网络系统发生故障的根因。需要说明的是，本领域普通技术人员容易想到，为简化诊断过程，提高诊断速度，可仅选择其中对应的第一故障事件数量较多的故障规则进行逻辑计算，本发明实施例对此不作具体限定。It is easy to understand that if there is only one fault rule determined in step S203, it is only necessary to perform logical calculation in combination with the fault rule according to the confirmation result of the suspected fault event and the second fault event in the first fault event; If there are multiple fault rules determined in S203, according to the suspected fault event in the first fault event and the confirmation result of the second fault event, logical calculation should be performed in combination with each fault rule in turn to determine the failure of the network system. 's root cause. It should be noted that those of ordinary skill in the art can easily think that, in order to simplify the diagnosis process and improve the diagnosis speed, only the fault rule with a large number of corresponding first fault events may be selected for logical calculation, which is not specified in this embodiment of the present invention. limited.

具体而言，若采用故障树作为故障规则，则可按照自顶向下或自底向上的原则进行逻辑计算。Specifically, if the fault tree is used as the fault rule, the logical calculation can be performed according to the principle of top-down or bottom-up.

示例性的，以下将以自定向下的原则为例，结合图3所示的故障树，简单说明逻辑计算的具体过程：Exemplarily, the following will take the principle of self-direction as an example, combined with the fault tree shown in FIG. 3, to briefly describe the specific process of logic calculation:

从顶事件T1开始向下搜索与顶事件T1通过逻辑门关联的事件T2、T3以及T8，若搜索到事件T2、T3及T8(即确认事件T2、T3及T8均成立)，则继续向下搜索与事件T2、T3及T8通过逻辑门关联的其它故障事件，直到搜索到底事件，搜索到的所有底事件即为故障根因。例如，假设向下搜索到事件T4、T5以及T7，则故障根因即为事件T4、T5、T7以及上一次搜索到的事件T8。其中，所谓底事件是指没有其它故障事件通过逻辑门与其关联的事件，如图3中的事件T4，在事件T4下面没有其他事件通过逻辑门与事件ST4关联，因此事件T4即是底事件。Start from the top event T1 to search down the events T2, T3 and T8 associated with the top event T1 through the logic gate. If the events T2, T3 and T8 are searched (that is, it is confirmed that the events T2, T3 and T8 are all established), then continue down Search for other fault events associated with events T2, T3 and T8 through logic gates until the bottom event is searched, and all bottom events found are the root cause of the fault. For example, assuming that events T4, T5 and T7 are searched down, the root cause of the failure is the events T4, T5, T7 and the last searched event T8. Among them, the so-called bottom event refers to an event that no other fault event is associated with it through a logic gate, such as event T4 in Figure 3, under event T4, no other event is associated with event ST4 through a logic gate, so event T4 is the bottom event.

具体的，如图4所示，本发明实施例提供的网络故障诊断方法中，网络故障诊断装置104根据日志、告警、配置及KPI数据确定检测异常网元以及异常信息(即S202)，具体可以包括：Specifically, as shown in FIG. 4 , in the network fault diagnosis method provided by the embodiment of the present invention, the network fault diagnosis apparatus 104 determines to detect abnormal network elements and abnormal information according to logs, alarms, configurations, and KPI data (ie, S202 ). include:

S202a、网络故障诊断装置104将日志、告警、配置及KPI数据解析为结构化数据，并提取结构化数据的特征值。S202a, the network fault diagnosis apparatus 104 parses the log, alarm, configuration and KPI data into structured data, and extracts characteristic values of the structured data.

S202b、网络故障诊断装置104利用机器学习算法对结构化数据的特征值进行特征统计，得到统计结果。S202b, the network fault diagnosis apparatus 104 uses a machine learning algorithm to perform feature statistics on the feature values of the structured data to obtain a statistical result.

S202c、网络故障诊断装置104根据统计结果，确定异常网元以及异常信息。S202c, the network fault diagnosis apparatus 104 determines the abnormal network element and the abnormal information according to the statistical result.

示例性的，可提取所采集数据的频率及周期性两个特征，并通过相关的机器学习算法对数据的频率及周期特征进行统计分析，确定异常网元及异常信息，其依据在于：出现频率越低、周期性越低的数据，越可能是与故障相关的数据。Exemplarily, two characteristics of the frequency and periodicity of the collected data can be extracted, and the frequency and periodic characteristics of the data can be statistically analyzed through a related machine learning algorithm to determine abnormal network elements and abnormal information, which is based on: frequency of occurrence. The lower, less periodic data, the more likely it is fault-related data.

进一步的，如图5所示，本发明实施例提供的网络故障诊断方法，在网络故障诊断装置104根据疑似故障事件、第二故障事件的确认结果以及故障规则进行逻辑计算，确定引起网络系统发生故障的根因之后，还可以包括：Further, as shown in FIG. 5 , in the network fault diagnosis method provided by the embodiment of the present invention, the network fault diagnosis device 104 performs logical calculation according to the suspected fault event, the confirmation result of the second fault event, and the fault rule, and determines that the occurrence of the network system is caused. After the root cause of the failure, it can also include:

S205、网络故障诊断装置104根据根因生成对应的故障恢复脚本，并向异常网元或网管设备102发送故障恢复脚本，以使异常网元或网管设备102根据故障恢复脚本修复网络系统发生的故障。S205. The network fault diagnosis apparatus 104 generates a corresponding fault recovery script according to the root cause, and sends the fault recovery script to the abnormal network element or the network management device 102, so that the abnormal network element or the network management device 102 can repair the fault that occurs in the network system according to the fault recovery script .

即，在发现故障根因后，针对该故障根因，生成对对应的恢复脚本，并发送给相关设备以修复该故障，以使网络系统恢复正常。That is, after the root cause of the failure is found, a corresponding recovery script is generated for the root cause of the failure, and sent to the relevant devices to repair the failure, so that the network system can be restored to normal.

优选的，本发明实施例提供的网络故障诊断方法，在网络故障诊断装置104根据异常网元的实时数据，分别对第一故障事件中的疑似故障事件及第二故障事件进行确认之后，还可进一步包括：Preferably, in the network fault diagnosis method provided by the embodiment of the present invention, after the network fault diagnosis apparatus 104 confirms the suspected fault event and the second fault event in the first fault event respectively according to the real-time data of the abnormal network element, the Further includes:

网络故障诊断装置104获取当前故障诊断过程中确认的故障事件以及历史故障事件，根据当前故障诊断过程中确认的故障事件以及历史故障事件，挖掘新的故障规则，并将新的故障规则存储至故障规则库中。The network fault diagnosis device 104 obtains the fault events and historical fault events confirmed in the current fault diagnosis process, mines new fault rules according to the fault events and historical fault events confirmed in the current fault diagnosis process, and stores the new fault rules in the fault diagnosis process. in the rule base.

其中，历史故障事件为之前的故障诊断过程中确认的故障事件。本领域普通技术人员容易理解，所述新的故障规则为故障规则库未覆盖的故障规则。The historical fault events are the fault events confirmed in the previous fault diagnosis process. Those skilled in the art can easily understand that the new fault rule is a fault rule not covered by the fault rule base.

本发明实施例的一种可能的实现方式中，可在每次故障诊断过程中将此次确认的故障事件存储至数据库中，形成历史故障事件库。这样，在挖掘新的故障规则时，即可直接读取历史故障事件库中的数据以获取确认的故障事件。In a possible implementation manner of the embodiment of the present invention, the fault event confirmed this time may be stored in a database in each fault diagnosis process to form a historical fault event database. In this way, when mining new fault rules, the data in the historical fault event database can be directly read to obtain confirmed fault events.

现有技术中，由于机器并不理解信息的语义，无法像技术人员那样根据信息的语义推理出故障信息之间的因果关系，并随着诊断次数的增加，不断积累经验，从而总结归纳出相应的故障规则用于以后的故障诊断，所以现有的网络故障诊断方法往往没有充分利用故障诊断过程中所获得的经验。而本发明实施例提供的网络故障诊断方法，通过积累每次故障诊断的经验，进而根据积累的经验发现当前故障规则库未覆盖的故障规则，因此能够提高故障定位的精度、扩大故障定位的广度。In the prior art, since the machine does not understand the semantics of the information, it cannot infer the causal relationship between the fault information according to the semantics of the information like a technician. The fault rules are used for subsequent fault diagnosis, so the existing network fault diagnosis methods often do not make full use of the experience gained in the fault diagnosis process. The network fault diagnosis method provided by the embodiment of the present invention, by accumulating the experience of each fault diagnosis, and then discovering the fault rules not covered by the current fault rule base according to the accumulated experience, thus can improve the accuracy of fault location and expand the breadth of fault location. .

优选的，如图6所示，本发明实施例提供的网络故障诊断方法，在网络故障诊断装置104采集网元101、网管设备102及监控设备103的日志、告警、配置及KPI数据之前，还可以包括：Preferably, as shown in FIG. 6 , in the network fault diagnosis method provided by the embodiment of the present invention, before the network fault diagnosis apparatus 104 collects the log, alarm, configuration and KPI data of the network element 101 , the network management device 102 and the monitoring device 103 , the Can include:

S206、网络故障诊断装置104获取网络系统可能出现的异常信息以及预存的故障规则库中的所有故障规则所包括的全部故障事件。S206 , the network fault diagnosis apparatus 104 acquires abnormal information that may occur in the network system and all fault events included in all fault rules in the pre-stored fault rule base.

S207、网络故障诊断装置104将网络系统可能出现的异常信息进行抽象，得到该异常信息对应的故障行为，以及，将预存的故障规则库中的每个故障规则所包括的故障事件进行抽象，得到该故障事件对应的故障行为。S207. The network fault diagnosis device 104 abstracts the abnormal information that may occur in the network system, obtains the fault behavior corresponding to the abnormal information, and abstracts the fault events included in each fault rule in the pre-stored fault rule base, and obtains The fault behavior corresponding to the fault event.

S208、网络故障诊断装置104根据异常信息对应的故障行为及故障事件对应的抽象行为，建立并存储异常信息与故障事件的对应关系。S208. The network fault diagnosis apparatus 104 establishes and stores the corresponding relationship between the abnormal information and the fault event according to the fault behavior corresponding to the abnormal information and the abstract behavior corresponding to the fault event.

例如，可将异常信息和故障事件抽象为以下4类故障行为：(1)业务：即网元所表现出的业务功能，例如，VPN业务，OSPF路由协议业务等；(2)系统：即网元所具有的非业务功能，具体可以是利用下层的硬件为上层的业务提供的基础功能，如告警管理、时钟管理等；(3)硬件：即网元的物理装置，如中央处理器(Central Processing Unit，CPU)、网口、主控板等。进一步的，上述的每一类故障行为又可具体抽象为以下的3个子类：(1)性能：如业务的流量下降，硬件的CPU使用率升高等；(2)事件：如业务的协议震荡，系统的时钟源丢失等；(3)配置：如VPN业务封装类型不一致。这样，以故障行为中介即可建立其异常信息与故障事件的映射关系，也即异常信息与故障事件的对应关系。For example, abnormal information and fault events can be abstracted into the following four types of fault behaviors: (1) Service: that is, the service function displayed by the network element, such as VPN service, OSPF routing protocol service, etc.; (2) System: that is, network The non-service functions of the element can specifically be the basic functions provided by the hardware of the lower layer for the services of the upper layer, such as alarm management, clock management, etc.; (3) Hardware: the physical device of the network element, such as the central processing unit (Central Processing Unit) Processing Unit, CPU), network port, main control board, etc. Further, each of the above-mentioned fault behaviors can be concretely abstracted into the following three sub-categories: (1) Performance: such as business traffic drops, hardware CPU usage increases, etc.; (2) events: such as business protocol oscillations , the system clock source is lost, etc.; (3) Configuration: For example, VPN service encapsulation types are inconsistent. In this way, the mapping relationship between the abnormal information and the fault event can be established by the intermediary of the fault behavior, that is, the corresponding relationship between the abnormal information and the fault event.

示例性的，参考图7，假设网络系统可能出现的异常信息包括：OSPF流量下降量超过阈值、OSPF_Nbr_UP或OSPF_Nbr_Down频繁出现、以及两个网元配置的封装类型不一致，故障规则库中的故障规则包括以下3个故障事件：OSPF流量下降、协议震荡以及邻居配置不一致，则通过将异常信息及故障事件抽象为图7中间所示的故障行为，即可通过故障行为建立异常信息与故障事件的对应关系，如表1所示。Exemplarily, referring to FIG. 7 , it is assumed that abnormal information that may occur in the network system includes: OSPF traffic drop exceeds a threshold, OSPF_Nbr_UP or OSPF_Nbr_Down frequently occurs, and the encapsulation types configured on two network elements are inconsistent. The fault rules in the fault rule base include: The following three fault events are: OSPF traffic drop, protocol flapping, and inconsistent neighbor configuration. By abstracting the abnormal information and fault events into the fault behavior shown in the middle of Figure 7, the corresponding relationship between abnormal information and fault events can be established through the fault behavior. ,As shown in Table 1.

现有的网络故障诊断方法需人工介入以确认故障，并且由于单条日志所能记录的信息有限，单条日志往往不会记录细粒度信息，因此可能无法找到导致网络发生故障的根因。而本发明实施例提供的网络故障诊断方法，通过采集网络系统相关设备的日志、告警、配置及KPI数据，根据所采集数据检测异常网元及异常信息，并根据异常信息及预存的对应关系确定每条异常信息所对应的第一故障事件，进而根据第一故障事件及预存的故障规则库确定故障规则，并通过采集异常网元的实时数据，利用异常网元的实时数据对相关的疑似故障事件进行确认，同时根据故障规则进行逻辑计算，即可确定引起网络系统发生故障的根因。由于通过异常信息与故障事件的对应关系即可将检测得到的异常信息自动映射为相关的故障事件，同时，根据具体的故障事件采集相关网元的实时数据，利用网元的实时数据即可对故障事件进行确认，进而根据确认结果以及相关的故障规则进行逻辑计算即可排除误报故障，同时对真实故障进行根因定位。可见，本发明实施例提供的网络故障诊断方法无需人工介入即可实现网络故障的自动定位，并且能够确定故障根因，实现了自动化故障诊断，提高了故障诊断效率。The existing network fault diagnosis methods require manual intervention to confirm the fault, and because the information that can be recorded in a single log is limited, a single log often does not record fine-grained information, so the root cause of the network failure may not be found. The method for diagnosing network faults provided by the embodiments of the present invention collects logs, alarms, configurations, and KPI data of related equipment in the network system, detects abnormal network elements and abnormal information according to the collected data, and determines the abnormal information and pre-stored corresponding relationships. The first fault event corresponding to each abnormal information, and then determine the fault rule according to the first fault event and the pre-stored fault rule database, and collect the real-time data of the abnormal network element, and use the real-time data of the abnormal network element to analyze the related suspected faults. The event is confirmed, and the logic calculation is performed according to the fault rule, so that the root cause of the network system failure can be determined. The detected abnormal information can be automatically mapped to the relevant fault events through the corresponding relationship between the abnormal information and the fault events. The fault event is confirmed, and then logical calculation is carried out according to the confirmation result and the relevant fault rules to eliminate the false alarm fault, and at the same time, the root cause of the real fault can be located. It can be seen that the network fault diagnosis method provided by the embodiment of the present invention can realize automatic location of network fault without manual intervention, and can determine the root cause of the fault, realize automatic fault diagnosis, and improve the efficiency of fault diagnosis.

基于上述方法，本发明实施例提供了一种网络故障诊断装置104，应用于图1所示的网络系统10，如图8所示，包括：数据获取模块1041、故障发现模块1042、事件映射模块1043以及故障确诊模块1044。Based on the above method, an embodiment of the present invention provides a network fault diagnosis apparatus 104, which is applied to the network system 10 shown in FIG. 1, and as shown in FIG. 8, including: a data acquisition module 1041, a fault discovery module 1042, and an event mapping module 1043 and a fault diagnosis module 1044.

其中，数据获取模块1041，用于采集网元101、网管设备102及监控设备103的日志、告警、配置及KPI数据。Among them, the data acquisition module 1041 is used to collect the log, alarm, configuration and KPI data of the network element 101 , the network management device 102 and the monitoring device 103 .

故障发现模块1042，用于根据日志、告警、配置及KPI数据检测异常网元以及异常信息。The fault finding module 1042 is configured to detect abnormal network elements and abnormal information according to log, alarm, configuration and KPI data.

事件映射模块1043，用于根据异常信息以及预存的对应关系，确定异常信息对应的第一故障事件。The event mapping module 1043 is configured to determine the first fault event corresponding to the abnormal information according to the abnormal information and the pre-stored corresponding relationship.

故障确诊模块1044，用于根据第一故障事件以及预存的故障规则库，确定与第一故障事件对应的故障规则。The fault diagnosis module 1044 is configured to determine a fault rule corresponding to the first fault event according to the first fault event and a pre-stored fault rule library.

数据获取模块1041，还用于采集异常网元的实时数据。The data acquisition module 1041 is further configured to collect real-time data of abnormal network elements.

故障确诊模块1044，还用于根据异常网元的实时数据，分别对第一故障事件中的疑似故障事件及第二故障事件进行确认，并根据疑似故障事件、第二故障事件的确认结果以及故障规则进行逻辑计算，确定引起网络系统发生故障的根因。The fault diagnosis module 1044 is further configured to confirm the suspected fault event and the second fault event in the first fault event respectively according to the real-time data of the abnormal network element, and according to the confirmation result of the suspected fault event and the second fault event and the fault The rules perform logical calculations to determine the root cause of the failure of the network system.

其中，所述的对应关系为异常信息与故障事件的对应关系；所述的故障规则库包括至少一个故障规则，每个故障规则包括至少两个故障事件以及至少两个故障事件之间的逻辑因果关系；所述的第二故障事件为N第一故障规则所包括的至少两个故障事件中除第一故障事件之外的故障事件。Wherein, the corresponding relationship is the corresponding relationship between abnormal information and fault events; the fault rule base includes at least one fault rule, and each fault rule includes at least two fault events and logical causality between the at least two fault events relationship; the second failure event is a failure event other than the first failure event among the at least two failure events included in the N first failure rule.

具体的，本发明实施例提供的网络故障诊断装置104中，故障发现模块1042具体可以用于：Specifically, in the network fault diagnosis apparatus 104 provided by the embodiment of the present invention, the fault discovery module 1042 may be specifically used for:

将日志、告警、配置及KPI数据解析为结构化数据，并提取结构化数据的特征值；Parse log, alarm, configuration and KPI data into structured data, and extract feature values of structured data;

利用机器学习算法对结构化数据的特征值进行特征统计，得到统计结果：Use machine learning algorithms to perform feature statistics on the eigenvalues of structured data, and obtain statistical results:

根据统计结果，确定异常网元以及异常信息。According to the statistical result, the abnormal network element and abnormal information are determined.

进一步的，如图9所示，本发明实施例提供的网络故障诊断装置104还可以包括：策略生成模块1045。Further, as shown in FIG. 9 , the network fault diagnosis apparatus 104 provided by the embodiment of the present invention may further include: a policy generation module 1045 .

策略生成模块1045，用于在故障确诊模块1044根据疑似故障事件、第二故障事件的确认结果以及故障规则进行逻辑计算，确定引起网络系统发生故障的根因之后，根据根因生成对应的故障恢复脚本，并向异常网元或网管设备102发送故障恢复脚本，以使异常网元或网管设备102根据故障恢复脚本修复网络系统发生的故障。The strategy generation module 1045 is used for the fault diagnosis module 1044 to perform logical calculation according to the suspected fault event, the confirmation result of the second fault event and the fault rule, and after determining the root cause of the network system failure, generate the corresponding fault recovery according to the root cause and send the fault recovery script to the abnormal network element or the network management device 102, so that the abnormal network element or the network management device 102 can repair the fault occurred in the network system according to the fault recovery script.

优选的，如图10所示，本发明实施例提供的网络故障诊断装置104还可进一步包括：故障规则挖掘模块1046。Preferably, as shown in FIG. 10 , the network fault diagnosis apparatus 104 provided by the embodiment of the present invention may further include: a fault rule mining module 1046 .

故障规则挖掘模块1046，用于在故障确诊模块1044根据异常网元的实时数据，分别对第一故障事件中的疑似故障事件及第二故障事件进行确认之后，获取故障确诊模块1044在当前故障诊断过程中确认的故障事件以及历史故障事件，根据故障确诊模块1044在当前故障诊断过程中确认的故障事件以及历史故障事件，挖掘新的故障规则，并将新的故障规则存储至故障规则库中。The fault rule mining module 1046 is used for, after the fault diagnosis module 1044 respectively confirms the suspected fault event and the second fault event in the first fault event according to the real-time data of the abnormal network element, obtains the current fault diagnosis of the fault diagnosis module 1044. For the fault events and historical fault events confirmed in the process, new fault rules are mined according to the fault events and historical fault events confirmed by the fault diagnosis module 1044 in the current fault diagnosis process, and the new fault rules are stored in the fault rule database.

其中，历史故障事件为故障确诊模块1044在之前的故障诊断过程中确认的故障事件。The historical fault events are the fault events confirmed by the fault diagnosis module 1044 in the previous fault diagnosis process.

优选的，本发明实施例提供的网络故障诊断装置104中，事件映射模块1043还可以用于：Preferably, in the network fault diagnosis apparatus 104 provided in the embodiment of the present invention, the event mapping module 1043 may also be used for:

在数据获取模块1041采集网元、网管设备102及监控设备103的日志、告警、配置及KPI数据之前，获取网络系统可能出现的异常信息以及预存的故障规则库中的每个故障规则所包括的故障事件；Before the data acquisition module 1041 collects the log, alarm, configuration and KPI data of the network element, the network management device 102 and the monitoring device 103, it acquires the abnormal information that may occur in the network system and the information included in each fault rule in the pre-stored fault rule base. failure event;

将网络系统可能出现的异常信息进行抽象，得到该异常信息对应的故障行为，以及，将预存的故障规则库中的所有故障规则所包括的全部故障事件进行抽象，得到该故障事件对应的故障行为；Abstract the abnormal information that may appear in the network system to obtain the fault behavior corresponding to the abnormal information, and abstract all the fault events included in all the fault rules in the pre-stored fault rule base to obtain the fault behavior corresponding to the fault event ;

根据异常信息对应的故障行为及故障事件对应的抽象行为，建立并存储异常信息与故障事件的对应关系。According to the fault behavior corresponding to the abnormal information and the abstract behavior corresponding to the fault event, the corresponding relationship between the abnormal information and the fault event is established and stored.

由于本实施例提供的网络故障诊断装置104能够用于执行上述网络故障诊断方法，因此，其所能获得的技术效果也可以参照上述方法实施例的描述，此处不再赘述。Since the network fault diagnosis apparatus 104 provided in this embodiment can be used to execute the above network fault diagnosis method, the technical effect that can be obtained can also be referred to the description of the above method embodiment, which is not repeated here.

此外，本发明实施例还提供了一种网络故障诊断装置，如图11所示，所述网络故障诊断装置110包括处理器1101。In addition, an embodiment of the present invention further provides a network fault diagnosis apparatus. As shown in FIG. 11 , the network fault diagnosis apparatus 110 includes a processor 1101 .

其中，所述处理器1101用于执行本发明实施例提供的网络故障诊断方法。The processor 1101 is configured to execute the network fault diagnosis method provided by the embodiment of the present invention.

由于本实施例中网络故障诊断装置110能够用于执行上述网络故障诊断方法，因此，其所能获得的技术效果也可以参照上述方法实施例的描述，此处不再赘述。Since the network fault diagnosis apparatus 110 in this embodiment can be used to execute the above network fault diagnosis method, the technical effect that can be obtained can also be referred to the description of the above method embodiment, which will not be repeated here.

此外，本发明实施例还提供一种计算可读媒体(或介质)，包括在被执行上述方法实施例中网络故障诊断装置110的操作的计算机可读指令：In addition, an embodiment of the present invention further provides a computer-readable medium (or medium), including computer-readable instructions for performing the operations of the network fault diagnosis apparatus 110 in the foregoing method embodiment:

另外，还提供一种计算机程序产品，包括上述计算机可读媒体。In addition, a computer program product is also provided, comprising the above-mentioned computer-readable medium.

应理解，在本发明的各种实施例中，上述各过程的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本发明实施例的实施过程构成任何限定。It should be understood that, in various embodiments of the present invention, the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, rather than the embodiments of the present invention. implementation constitutes any limitation.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的装置，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将装置的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, the above-described device is only illustrated by the division of the above-mentioned functional modules. The function module is completed, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the functions described above. For the specific working processes of the systems, devices and units described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described herein again.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述模块或单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. should be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A network fault diagnosis method is applied to a network system, the network system comprises a network element, network management equipment, monitoring equipment and a network fault diagnosis device, and the method comprises the following steps:

the network fault diagnosis device acquires logs, alarms, configurations and Key Performance Indicator (KPI) data of the network elements, the network management equipment and the monitoring equipment, and detects abnormal network elements and abnormal information according to the logs, the alarms, the configurations and the KPI data;

the network fault diagnosis device determines a first fault event corresponding to the abnormal information according to the abnormal information and a pre-stored corresponding relationship, wherein the pre-stored corresponding relationship is the corresponding relationship between the abnormal information and the fault event;

the network fault diagnosis device determines a first fault rule corresponding to the first fault event according to the first fault event and a prestored fault rule base; wherein the fault rule base comprises at least one fault rule, each fault rule comprising at least two fault events and a logical causal relationship between the at least two fault events;

the network fault diagnosis device acquires real-time data of the abnormal network element, respectively confirms a suspected fault event and a second fault event in the first fault event according to the real-time data of the abnormal network element, and performs logic calculation according to the suspected fault event, the confirmation result of the second fault event and the first fault rule to determine a root cause causing the network system to have a fault; wherein the second failure event is a failure event other than the first failure event in at least two failure events included in the first failure rule.

2. The method of claim 1, wherein the network fault diagnosis device detects abnormal network elements and abnormal information according to the log, alarm, configuration and KPI data, and comprises:

the network fault diagnosis device analyzes the log, the alarm, the configuration and the KPI data into structured data and extracts characteristic values of the structured data;

the network fault diagnosis device carries out feature statistics on the feature values of the structured data by utilizing a machine learning algorithm to obtain a statistical result;

and the network fault diagnosis device determines abnormal network elements and abnormal information according to the statistical result.

3. The method according to claim 1 or 2, wherein after the network fault diagnosis device performs logical calculation according to the suspected fault event, the confirmation result of the second fault event and the fault rule, and determines a root cause causing the network system to fail, the method further comprises:

and the network fault diagnosis device generates a corresponding fault recovery script according to the root cause and sends the fault recovery script to the abnormal network element or the network management equipment, so that the abnormal network element or the network management equipment repairs the fault of the network system according to the fault recovery script.

4. The method according to claim 1 or 2, wherein after the network fault diagnosis device respectively confirms the suspected fault event and the second fault event in the first fault event according to the real-time data of the abnormal network element, the method further comprises:

the network fault diagnosis device acquires a fault event confirmed in the current fault diagnosis process and a historical fault event, excavates a new fault rule according to the fault event confirmed in the current fault diagnosis process and the historical fault event, and stores the new fault rule into the fault rule base; wherein the historical fault event is a fault event confirmed by the network fault diagnosis device in a previous fault diagnosis process.

5. The method according to claim 3, wherein after the network fault diagnosis device respectively confirms the suspected fault event and the second fault event in the first fault event according to the real-time data of the abnormal network element, the method further comprises:

6. The method according to claim 1, 2 or 5, wherein before the network fault diagnosis device collects log, alarm, configuration and KPI data of the network elements, the network management equipment and the monitoring equipment, the method further comprises:

the network fault diagnosis device acquires abnormal information which may appear in the network system and fault events included in each fault rule in the prestored fault rule base;

the network fault diagnosis device abstracts abnormal information which may appear in the network system to obtain fault behaviors corresponding to the abnormal information, and abstracts fault events included in each fault rule in the prestored fault rule base to obtain fault behaviors corresponding to the fault events;

and the network fault diagnosis device establishes and stores the corresponding relation between the abnormal information and the fault event according to the fault behavior corresponding to the abnormal information and the abstract behavior corresponding to the fault event.

7. The method according to claim 3, wherein before the network fault diagnosis apparatus collects log, alarm, configuration and KPI data of the network element, the network management equipment and the monitoring equipment, the method further comprises:

8. The method according to claim 4, wherein before the network fault diagnosis apparatus collects log, alarm, configuration and KPI data of the network element, the network management equipment and the monitoring equipment, the method further comprises:

9. A network fault diagnosis device is characterized in that the network fault diagnosis device is applied to a network system, the network system further comprises a network element, network management equipment and monitoring equipment, and the network fault diagnosis device comprises: the system comprises a data acquisition module, a fault discovery module, an event mapping module and a fault diagnosis module;

the data acquisition module is used for acquiring logs, alarms, configuration and Key Performance Indicator (KPI) data of the network element, the network management equipment and the monitoring equipment;

the fault finding module is used for detecting abnormal network elements and abnormal information according to the log, the alarm, the configuration and the KPI data;

the event mapping module is used for obtaining a first fault event corresponding to the abnormal information according to the abnormal information and a pre-stored corresponding relationship, wherein the pre-stored corresponding relationship is the corresponding relationship between the abnormal information and the fault event;

the fault diagnosis module is used for determining a first fault rule corresponding to the first fault event according to the first fault event and a prestored fault rule base; wherein the fault rule base comprises at least one fault rule, each fault rule comprising at least two fault events and a logical relationship between the at least two fault events;

the data acquisition module is further used for acquiring real-time data of the abnormal network element;

the failure determination module is further configured to respectively determine a suspected failure event and a second failure event in the first failure event according to the real-time data of the abnormal network element, perform logical calculation according to the suspected failure event, the determination result of the second failure event, and the failure rule, and determine a root cause causing the network system to fail, where the second failure event is a failure event, except for the first failure event, in at least two failure events included in the first failure rule.

10. The apparatus of claim 9, wherein the fault discovery module is specifically configured to:

analyzing the log, the alarm, the configuration and the KPI data into structured data, and extracting characteristic values of the structured data;

performing characteristic statistics on the characteristic value of the structured data by using a machine learning algorithm to obtain a statistical result;

and determining abnormal network elements and abnormal information according to the statistical result.

11. The apparatus of claim 9 or 10, further comprising: a policy generation module;

the policy generation module is configured to, after the failure confirmation module performs logical calculation according to the suspected failure event, the confirmation result of the second failure event, and the failure rule, and determines a root cause causing a failure in the network system, generate a corresponding failure recovery script according to the root cause, and send the failure recovery script to the abnormal network element or the network management device, so that the network element or the network management device repairs the failure in the network system according to the failure recovery script.

12. The apparatus of claim 9 or 10, further comprising: a fault rule mining module;

the fault rule mining module is configured to, after the fault confirmation module confirms the suspected fault event and the second fault event in the first fault event respectively according to the real-time data of the abnormal network element, acquire the fault event and the historical fault event confirmed by the fault confirmation module in the current fault diagnosis process, mine a new fault rule according to the fault event and the historical fault event confirmed by the fault confirmation module in the current fault diagnosis process, and store the new fault rule in the fault rule base;

wherein the historical fault event is a fault event confirmed by the fault confirmation module in a previous fault diagnosis process.

13. The apparatus of claim 11, further comprising: a fault rule mining module;

14. The apparatus of claim 9, 10 or 13, wherein the event mapping module is further configured to:

before the data acquisition module acquires the log, alarm, configuration and KPI data of the network element, the network management equipment and the monitoring equipment, acquiring possible abnormal information of the network system and fault events included in each fault rule in the prestored fault rule base;

abstracting abnormal information which may appear in the network system to obtain a fault behavior corresponding to the abnormal information, and abstracting a fault event included in each fault rule in the prestored fault rule library to obtain a fault behavior corresponding to the fault event;

and establishing and storing a corresponding relation between the abnormal information and the fault event according to the fault behavior corresponding to the abnormal information and the abstract behavior corresponding to the fault event.

15. The apparatus of claim 11, wherein the event mapping module is further configured to:

16. The apparatus of claim 12, wherein the event mapping module is further configured to: