CN112231523B - Network fault positioning and troubleshooting method and system based on directed acyclic graph - Google Patents
Network fault positioning and troubleshooting method and system based on directed acyclic graph Download PDFInfo
- Publication number
- CN112231523B CN112231523B CN202011124262.0A CN202011124262A CN112231523B CN 112231523 B CN112231523 B CN 112231523B CN 202011124262 A CN202011124262 A CN 202011124262A CN 112231523 B CN112231523 B CN 112231523B
- Authority
- CN
- China
- Prior art keywords
- fault
- loop
- equipment
- directed acyclic
- acyclic graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
- Y04S10/52—Outage or fault management, e.g. fault detection or location
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本发明涉及一种基于有向无环图的网络故障定位排查方法及系统,方法包括:获取计算机网络系统及其组成设备;根据计算机网络系统及其组成设备绘制有向无环图;根据有向无环图获取出现网络故障的故障回路;根据实施难度或贡献度获取故障回路的排查方法;采用排查方法对故障回路进行故障定位;判断故障是否为被检设备;若是,则显示被检设备预设的故障解决方案;若否,则采用排查方法对上游方向互联网回路以及下游方向设备终端回路进行故障定位;直至得到上游方向互联网回路以及下游方向设备终端回路中的故障节点并显示故障节点预设的故障解决方案。本发明能够快速实现网络故障定位,降低时间和人力成本。
The invention relates to a method and system for locating and troubleshooting network faults based on directed acyclic graphs. The method includes: obtaining a computer network system and its component equipment; drawing a directed acyclic graph based on the computer network system and its component equipment; The acyclic graph obtains the faulty loop where the network fault occurs; obtains the troubleshooting method of the faulty loop based on the difficulty of implementation or contribution; uses the troubleshooting method to locate the faulty loop; determines whether the fault is the equipment under inspection; if so, displays the preview of the equipment under inspection. If not, use troubleshooting methods to locate faults in the upstream direction Internet loop and downstream direction equipment terminal loops; until the fault nodes in the upstream direction Internet loop and downstream direction equipment terminal loops are obtained and the fault node presets are displayed Troubleshooting. The invention can quickly realize network fault location and reduce time and labor costs.
Description
技术领域Technical field
本发明涉及计算机网络行业运维领域,特别是涉及一种基于有向无环图的网络故障定位排查方法及系统。The invention relates to the field of computer network industry operation and maintenance, and in particular to a network fault location and troubleshooting method and system based on directed acyclic graphs.
背景技术Background technique
网络对现代企业生产的重要性日益提高,生产设备的运行和管理、内部沟通、对外展示,都要求有通畅的网络连接。但是物理(设备老化、更新换代等)和逻辑(配置错误、IP变更等)导致的故障难以避免。而且随着网络结构越来越复杂,可能发生故障的通路数量,更是随设备数量增加非线性增长,使得故障的定位越来越困难。The importance of the network to modern enterprise production is increasing day by day. The operation and management of production equipment, internal communication, and external display all require smooth network connections. However, failures caused by physical (equipment aging, replacement, etc.) and logic (configuration errors, IP changes, etc.) are difficult to avoid. Moreover, as the network structure becomes more and more complex, the number of paths that may fail increases non-linearly with the increase in the number of devices, making it increasingly difficult to locate faults.
现有技术中一般采用基于监控工具和专家经验的处理方式:监控工具(如zabbix)能够在故障出现时,第一时间提醒给运维人员。但由于一台设备的故障常会导致其上下游设备也出现数据异常,从而发生“告警风暴”现象(大量设备同时告警);无助于问题解决。而对于非运维部门员工,因为缺少监控工具和运维经验,在无法联网时只能先自行摸索,不成功再上报,效率较低。In the existing technology, a processing method based on monitoring tools and expert experience is generally adopted: monitoring tools (such as zabbix) can alert operation and maintenance personnel as soon as a fault occurs. However, the failure of one device often causes data anomalies in its upstream and downstream devices, resulting in an "alarm storm" phenomenon (a large number of devices alarming at the same time); which does not help solve the problem. For non-operation and maintenance department employees, due to the lack of monitoring tools and operation and maintenance experience, they can only explore on their own when they cannot connect to the Internet, and then report if they are unsuccessful, which is inefficient.
有经验的专家,首先会理清设备间的上下游关系,根据当前的故障症状,用最有效的测试方法,快速定位到问题设备;再根据此前处理过的案例历史,确定最有可能出现故障的方向(配置不当、设备老化、端口松动),然后以最低成本的解决方案来处理故障。Experienced experts will first clarify the upstream and downstream relationships between equipment, use the most effective testing methods to quickly locate the problem equipment based on the current fault symptoms, and then determine the most likely failure based on the history of previously handled cases. direction (improper configuration, aging equipment, loose ports), and then handle the failure with the lowest-cost solution.
理想情况下,即1)网络拓扑的变化(设备增减、设备间连接方式等),能实时同步给每位专家;并且2)故障发生时,专家能够第一时间到达现场,以稳定的精神和体力状态处理故障。这样网络故障造成的损失可以最小化。然而毫无疑问,达成这种状态的人力成本对于企业来说将难以承受。Ideally, that is, 1) changes in network topology (addition or deletion of equipment, connection methods between devices, etc.) can be synchronized to each expert in real time; and 2) when a fault occurs, the experts can arrive at the scene as soon as possible and with a stable spirit and physical status handling failures. In this way, losses caused by network failures can be minimized. However, there is no doubt that the human cost of achieving this state will be unbearable for enterprises.
发明内容Contents of the invention
本发明的目的是提供一种基于有向无环图的网络故障定位排查方法及系统,能够快速实现故障定位,并给出解决方案,降低网络维护的时间和人力成本。The purpose of the present invention is to provide a network fault location and troubleshooting method and system based on directed acyclic graphs, which can quickly realize fault location, provide solutions, and reduce network maintenance time and labor costs.
为实现上述目的,本发明提供了如下方案:In order to achieve the above objects, the present invention provides the following solutions:
一种基于有向无环图的网络故障定位排查方法,包括:A network fault location and troubleshooting method based on directed acyclic graph, including:
获取计算机网络系统及其组成设备;Acquire computer network systems and their component equipment;
根据所述获取计算机网络系统及其组成设备绘制有向无环图;Draw a directed acyclic graph according to the acquisition of the computer network system and its component equipment;
根据所述有向无环图获取出现网络故障的故障回路;Obtain the fault loop where the network fault occurs according to the directed acyclic graph;
根据实施难度或贡献度获取所述故障回路的排查方法,所述排查方法按照检测方向不同分为绕过被检设备、上游方向互联网回路以及下游方向设备终端回路;Obtain the troubleshooting method of the fault loop based on the difficulty of implementation or degree of contribution. The troubleshooting method is divided into bypassing the inspected equipment, upstream direction Internet loop, and downstream direction equipment terminal loop according to different detection directions;
采用所述排查方法对所述故障回路进行故障定位;Use the troubleshooting method to locate the fault of the fault circuit;
判断所述故障是否为所述被检设备;Determine whether the fault is the equipment under inspection;
若是,则显示所述被检设备预设的故障解决方案;If so, display the preset fault solution for the device under inspection;
若否,则采用所述排查方法对上游方向互联网回路以及下游方向设备终端回路进行故障定位;If not, use the troubleshooting method to locate the fault in the upstream direction Internet loop and the downstream direction equipment terminal loop;
直至得到上游方向互联网回路以及下游方向设备终端回路中的故障节点并显示所述故障节点预设的故障解决方案。Until the faulty node in the upstream direction Internet loop and the downstream direction equipment terminal loop is obtained and the preset fault solution for the faulty node is displayed.
可选的,所述根据所述获取计算机网络系统及其组成设备绘制有向无环图,包括:Optionally, drawing a directed acyclic graph based on the acquisition of the computer network system and its component devices includes:
将所述组成设备描绘为节点,所述组成设备之间的连接关系描绘为边。The component devices are depicted as nodes, and the connection relationships between the component devices are depicted as edges.
可选的,所述排查方法包括:Optionally, the troubleshooting methods include:
上游方向互联网回路检测方法、下游方向设备终端回路检测方法和绕过被检设备检测方法。The upstream direction Internet loop detection method, the downstream direction device terminal loop detection method, and the detection method for bypassing the inspected equipment.
可选的,所述上游方向互联网回路检测方法包括:Optionally, the upstream direction Internet loop detection method includes:
获取有向无环图中被检设备所在节点及其兄弟节点;Obtain the node where the inspected device is located and its sibling nodes in the directed acyclic graph;
从所述所述被检设备及其兄弟节点的设备向上游方向互联网发送ping包;Send ping packets from the device under inspection and its sibling node devices to the upstream direction to the Internet;
若所述被检设备不能连接到互联网,而其兄弟节点可以,则能确定故障在该设备;If the device under inspection cannot connect to the Internet, but its sibling nodes can, it can be determined that the fault lies with the device;
若各兄弟节点都不能连接,则故障出现在上游方向互联网回路。If all sibling nodes cannot connect, the fault occurs in the upstream direction Internet loop.
可选的,所述下游方向设备终端回路检测方法包括:Optionally, the downstream direction equipment terminal loop detection method includes:
从所述下游方向设备终端向被检设备发送ping包;Send a ping packet from the downstream direction device terminal to the device under inspection;
若所述下游方向设备终端与向被检设备能够连通,则故障出现在上游方向互联网回路;If the downstream device terminal can communicate with the device being inspected, the fault occurs in the upstream Internet loop;
若所述下游方向设备终端与向被检设备不能联通,则故障出现在下游方向设备终端回路。If the downstream equipment terminal cannot communicate with the equipment being inspected, the fault occurs in the downstream equipment terminal loop.
可选的,所述绕过被检设备检测方法包括:Optionally, the method for bypassing the detected device detection includes:
从下游方向设备终端连接向上游方向互联网发送ping包;Send ping packets from the downstream direction device terminal connection to the upstream direction Internet;
若下游方向设备终端连接和上游方向互联网能够连通,则故障出现在被检设备;If the downstream device terminal connection and the upstream Internet can be connected, the fault occurs in the device being inspected;
若下游方向设备终端连接和上游方向互联网不能连通,则问题出现在上游方向互联网回路或者下游方向设备终端回路。If the downstream device terminal connection and the upstream Internet cannot be connected, the problem occurs in the upstream Internet loop or the downstream device terminal loop.
可选的,实施难度越大对应的预设值越大。Optionally, the greater the implementation difficulty, the greater the corresponding preset value.
可选的,贡献度越大对应的预设值越大。Optionally, the greater the contribution, the greater the preset value.
可选的,所述根据实施难度或贡献度获取所述故障回路的排查方法,包括:Optionally, the troubleshooting method for obtaining the fault loop based on implementation difficulty or contribution includes:
判断所述排查方法的实施难度预设值是否相等;Determine whether the preset values of implementation difficulty of the troubleshooting method are equal;
若否,则选取实施难度预设值小的故障回路排查方法;If not, select the fault loop troubleshooting method with a smaller preset difficulty level;
若是,则选取贡献度预设值大的故障回路排查方法。If so, select the fault loop troubleshooting method with a larger preset contribution value.
一种基于有向无环图的网络故障定位排查系统,包括:A network fault location and troubleshooting system based on directed acyclic graph, including:
设备获取模块,用于获取计算机网络系统及其组成设备;The device acquisition module is used to acquire the computer network system and its component devices;
图像绘制模块,用于根据所述获取计算机网络系统及其组成设备绘制有向无环图;An image drawing module, used to draw a directed acyclic graph according to the acquisition computer network system and its component equipment;
故障回路确定模块,用于根据所述有向无环图获取出现网络故障的故障回路;A fault loop determination module, configured to obtain a fault loop in which a network fault occurs based on the directed acyclic graph;
排查方法选择模块,用于根据实施难度或贡献度获取所述故障回路的排查方法,所述排查方法按照检测方向不同分为绕过被检设备、上游方向互联网回路以及下游方向设备终端回路;A troubleshooting method selection module is used to obtain the troubleshooting method of the fault loop based on implementation difficulty or contribution. The troubleshooting method is divided into bypassing the inspected equipment, upstream direction Internet loop, and downstream direction equipment terminal loop according to different detection directions;
第一故障定位模块,用于采用所述排查方法对所述故障回路进行故障定位;A first fault location module, configured to use the troubleshooting method to locate faults on the fault circuit;
判断模块,用于判断所述故障是否为所述被检设备;A judgment module, used to judge whether the fault is the equipment under inspection;
第一显示模块,用于当被检设备出现故障时显示所述被检设备预设的故障解决方案;The first display module is used to display the preset fault solution of the inspected equipment when the inspected equipment fails;
第二故障定位模块,用于采用所述排查方法对上游方向互联网回路以及下游方向设备终端回路进行故障定位;The second fault location module is used to use the troubleshooting method to locate faults on the upstream direction Internet loop and the downstream direction equipment terminal loop;
第二显示模块,用于显示上游方向互联网回路以及下游方向设备终端回路中故障节点预设的故障解决方案。The second display module is used to display the preset fault solutions for faulty nodes in the upstream direction Internet loop and the downstream direction equipment terminal loop.
根据本发明提供的具体实施例,本发明公开了以下技术效果:According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:
本发明设计了基于有向无环图和专家知识的故障处理方法及系统,基于有向无环图的因果关系推理在于排查和定位问题;专家知识用于在推理的每一步给出设备检查方法,并在定位后给出故障处理手段。使用本发明,如同一位网络专家在旁协助,这将使得使用者能在发现故障的第一时间,高质量解决地处理故障,最小化故障带来的损失。The present invention designs a fault handling method and system based on directed acyclic graphs and expert knowledge. Causal relationship reasoning based on directed acyclic graphs is used to troubleshoot and locate problems; expert knowledge is used to provide equipment inspection methods at each step of reasoning. , and provide troubleshooting methods after locating. Using the present invention, as if a network expert is assisting, the user can handle the fault with high quality as soon as the fault is discovered and minimize the losses caused by the fault.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the drawings of the present invention. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.
图1为本发明基于有向无环图的网络故障定位排查方法流程图;Figure 1 is a flow chart of the network fault location and troubleshooting method based on directed acyclic graph according to the present invention;
图2为本发明基于有向无环图的网络故障定位排查系统模块图;Figure 2 is a module diagram of the network fault location and troubleshooting system based on directed acyclic graph according to the present invention;
图3为本发明有向无环图的结构示意图。Figure 3 is a schematic structural diagram of a directed acyclic graph of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.
本发明的目的是提供一种基于有向无环图的网络故障定位排查方法及系统,能够快速实现故障定位,并给出解决方案。The purpose of the present invention is to provide a network fault location and troubleshooting method and system based on directed acyclic graphs, which can quickly realize fault location and provide solutions.
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more obvious and understandable, the present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.
图1为本发明基于有向无环图的网络故障定位排查方法流程图,如图1所示,一种基于有向无环图的网络故障定位排查方法,包括:Figure 1 is a flow chart of the network fault location and troubleshooting method based on directed acyclic graphs of the present invention. As shown in Figure 1, a network fault location and troubleshooting method based on directed acyclic graphs includes:
步骤101:获取计算机网络系统及其组成设备;Step 101: Obtain the computer network system and its component equipment;
步骤102:根据所述获取计算机网络系统及其组成设备绘制有向无环图;Step 102: Draw a directed acyclic graph based on the acquisition of the computer network system and its component devices;
步骤103:根据所述有向无环图获取出现网络故障的故障回路;Step 103: Obtain the fault loop where the network fault occurs according to the directed acyclic graph;
步骤104:根据实施难度或贡献度获取所述故障回路的排查方法,所述排查方法按照检测方向不同分为绕过被检设备、上游方向互联网回路以及下游方向设备终端回路;Step 104: Obtain the troubleshooting method of the fault loop based on the difficulty of implementation or degree of contribution. The troubleshooting method is divided into bypassing the inspected equipment, upstream direction Internet loop, and downstream direction equipment terminal loop according to different detection directions;
步骤105:采用所述排查方法对所述故障回路进行故障定位;Step 105: Use the troubleshooting method to locate the fault of the fault circuit;
步骤106:判断所述故障是否为所述被检设备;Step 106: Determine whether the fault is the equipment under inspection;
步骤107:若是,则显示所述被检设备预设的故障解决方案;Step 107: If yes, display the preset fault solution for the device under inspection;
步骤108:若否,则采用所述排查方法对上游方向互联网回路以及下游方向设备终端回路进行故障定位;Step 108: If not, use the troubleshooting method to locate faults in the upstream direction Internet loop and the downstream direction equipment terminal loop;
步骤109:直至得到上游方向互联网回路以及下游方向设备终端回路中的故障节点并显示所述故障节点预设的故障解决方案。Step 109: Obtain the faulty node in the upstream direction Internet loop and the downstream direction equipment terminal loop and display the preset fault solution for the faulty node.
下面对本发明的上述方案做进一步详细的阐述:The above solution of the present invention will be further elaborated below:
本发明的基础数据结构是有向无环图。The basic data structure of the present invention is a directed acyclic graph.
有向图的结构反映了各节点间的因果联系,如图3所示,问题链路始终以图中深灰色标识。每个节点代表一个真实的设备,相应的检查项目、修复手段及“贡献度”,作为节点的属性存储;因果联系由边及其方向确定。The structure of a directed graph reflects the causal connections between nodes, as shown in Figure 3. Problem links are always marked in dark gray in the graph. Each node represents a real device, and the corresponding inspection items, repair methods and "contribution degree" are stored as attributes of the node; the causal link is determined by the edge and its direction.
节点内部的细分结构同样由有向无环图图存储。专家知识因此由多个“层级”的有向图完整表达。在用户使用系统的过程中,更底层的知识可以由不断的钻取动作来获取。The subdivision structure inside the node is also stored by the directed acyclic graph graph. Expert knowledge is therefore fully expressed by a directed graph of multiple "levels". In the process of users using the system, lower-level knowledge can be obtained through continuous drilling actions.
具体过程包括:The specific process includes:
1、获取故障设备的名称及类型1. Obtain the name and type of the faulty device
2、消除故障设备的冗余回路2. Eliminate redundant loops of faulty equipment
一台设备可以经多条路径连入因特网(如图3中的“pc_1”,可以通过有线和无线两种方式连入互联网)。在有向图(也叫有向无环图)中,表现为一个节点有多个父节点。在最坏情况下,可能的通路数量随(问题终端到互联网的连通路径上)此类节点的数量指数增加,故障检测的时间复杂度因此指数增加。A device can connect to the Internet through multiple paths ("pc_1" in Figure 3, which can connect to the Internet through wired and wireless methods). In a directed graph (also called a directed acyclic graph), a node has multiple parent nodes. In the worst case, the number of possible paths increases exponentially with the number of such nodes (on the connection path from the problem terminal to the Internet), and the time complexity of fault detection therefore increases exponentially.
但真实情况是,故障路径通常是确定的和已知的。比如图3中,虽然“pc_1”既可以通过网线,也可以通过无线接入点来接入交换机;但出现故障时使用了哪种连接方式,这个问题对用户是已知的。只需一次交互,系统即可消除掉一个节点处的冗余回路,确定故障回路。But the reality is that the failure path is usually certain and known. For example, in Figure 3, although "pc_1" can access the switch through either a network cable or a wireless access point, the problem of which connection method is used when a fault occurs is known to the user. With just one interaction, the system can eliminate redundant loops at a node and identify faulty loops.
步骤3,具体故障检测Step 3, specific fault detection
检测一个设备是否存在故障,按照检测方向的不同,有三种推理逻辑:To detect whether a device has a fault, there are three reasoning logics depending on the detection direction:
上游方向(“upstream”;相对于被检设备,互联网为上游方向,终端设备为下游方向,下同):例如,从一台DNS服务器,向因特网的其他服务器发送ping包。此时,通过判断在有向图中的兄弟节点(指具有同一个父节点;如图3中的“pc_1”和“mac_imac”是兄弟节点,因为它们有同一个父节点“switch_office_1”;按此定义,“pc_1”也是它自身的兄弟节点)是否存在故障来确定。具体来说,如果该设备不能连接到互联网,而其兄弟节点可以,则能确定故障在该设备。如果各兄弟节点都不能连接,则故障出现在上游(父节点及以上)。Upstream direction ("upstream"; relative to the device being inspected, the Internet is the upstream direction, and the terminal device is the downstream direction, the same below): For example, sending ping packets from a DNS server to other servers on the Internet. At this time, by judging the sibling nodes in the directed graph (referring to having the same parent node; "pc_1" and "mac_imac" in Figure 3 are sibling nodes because they have the same parent node "switch_office_1"; click here Definition, "pc_1" is also its own sibling node) to determine whether there is a fault. Specifically, if the device cannot connect to the Internet but its sibling nodes can, the fault can be determined to be with that device. If all sibling nodes cannot connect, the fault occurs upstream (parent node and above).
下游方向(“downstream”):从问题终端向被检设备发连接请求。比如从问题终端向一台DNS服务器发送ping包。如果能够连通,则故障出现在该DNS服务器上游;如果不能,则故障出现在该DNS服务器下游。Downstream direction ("downstream"): Send a connection request from the problem terminal to the device under inspection. For example, send a ping packet from the problem terminal to a DNS server. If the connection is possible, the fault occurs upstream of the DNS server; if not, the fault occurs downstream of the DNS server.
绕过(“bypass”):从问题终端绕过被检设备连入互联网。比如从问题终端通过IP连接绕过一台DNS服务器,或通过网线上网来绕过一个无线接入点。如果能够连通,则故障出现在该设备,如果不能,问题出现在被检设备上游或者下游。以上推断逻辑汇总于表1中。Bypass ("bypass"): Bypassing the inspected device from the problematic terminal to connect to the Internet. For example, bypassing a DNS server through an IP connection from the problematic endpoint, or bypassing a wireless access point through a network cable. If it can be connected, the fault occurs in the device. If not, the problem occurs in the upstream or downstream of the device being inspected. The above inference logic is summarized in Table 1.
步骤4,内部结构钻取Step 4, drill into the internal structure
节点可能存在内部结构。忽略内部结构会导致问题定位不准确,直接从最底层构建整个图,则会使得图过于复杂。本发明通过模块化来解决这一两难问题。系统在初步定位到故障点或者故障段之后,用户可以通过不断钻取,对更底层结构进行逐层分析,即针对每一层,都可以理解为跳转至“获取故障设备的名称及类型”,循环步骤2-3,直至定位到具体的故障节点。Nodes may have internal structures. Ignoring the internal structure will lead to inaccurate problem location, and building the entire graph directly from the bottom layer will make the graph too complex. The present invention solves this dilemma through modularization. After the system initially locates the fault point or fault segment, the user can continue to drill down to analyze the lower-level structure layer by layer. That is, for each layer, it can be understood as jumping to "Get the name and type of the faulty device" , loop steps 2-3 until the specific fault node is located.
步骤5,诊断/修复方法的统计信息存储、展示和更新Step 5, statistical information storage, display and update of diagnosis/repair methods
检测和修复方法的信息均包含定量描述,包括“贡献度”及“实施难度”两个数值。这些数值由专家来初始化。其中,实施难度在初始化后维持不变;贡献度数值会在每次诊断后自动更新。Information on detection and repair methods includes quantitative descriptions, including two values: "contribution degree" and "implementation difficulty." These values are initialized by experts. Among them, the implementation difficulty remains unchanged after initialization; the contribution value will be automatically updated after each diagnosis.
一个实施例为:An example is:
当前检查进行到“华为交换机1”。知识库中该设备节点事先已录入了3种检测手段及各自对应的“实施难度”和“贡献度”,分别为:The current check proceeds to "Huawei switch 1". The equipment node in the knowledge base has previously entered 3 detection methods and their corresponding "implementation difficulty" and "contribution degree", which are:
1)“绕过交换机,直接接入上游路由器”,实施难度30,贡献度2;1) "Bypass the switch and directly connect to the upstream router", implementation difficulty 30, contribution 2;
2)“换一个接口”,实施难度10,贡献度50;2) "Change an interface", implementation difficulty is 10, contribution is 50;
3)“更换接入交换机的网线”,实施难度30,贡献度10。3) "Replace the network cable connected to the switch", the implementation difficulty is 30 and the contribution is 10.
系统给出的检查建议,顺序将是2)3)1);即最简单的检查方法排在最前,如果难易相同(如1)和3))则贡献度更大的排在前面。The order of inspection suggestions given by the system will be 2) 3) 1); that is, the simplest inspection method is ranked first. If the difficulty is the same (such as 1) and 3)), the one with greater contribution will be ranked first.
事件处理过程中,如果一个检测方法对应的问题选项被选择(也因此对问题定位做出贡献),则该检测方法的“贡献度”数值+1;最终各检测方法的贡献度数值按公式(1)所示规则(最大贡献值归一化为100,其余贡献值按同样比例缩放后取整)同步更新:During the event processing, if the question option corresponding to a detection method is selected (and therefore contributes to problem location), the "contribution degree" value of the detection method is +1; the final contribution value of each detection method is according to the formula ( 1) The rules shown (the maximum contribution value is normalized to 100, and the remaining contribution values are scaled and rounded in the same proportion) are updated simultaneously:
公式(1)为:Formula (1) is:
该方法贡献度=int(该方法贡献度/各检测方法贡献度中的最大值*100)Contribution of this method = int (contribution of this method / maximum value of contribution of each detection method * 100)
同样的,如果一个修复方法修复了故障,则该方法的“贡献度”数值+1;最终各修复方法的可能性按公式(2)同步更新Similarly, if a repair method fixes the fault, the "contribution degree" value of the method is +1; in the end, the possibility of each repair method is updated simultaneously according to formula (2)
公式(2)为:(最大贡献值归一化为100,其余贡献值按同样比例缩放后取整)Formula (2) is: (The maximum contribution value is normalized to 100, and the remaining contribution values are scaled in the same proportion and rounded)
该方法贡献度=int(该方法贡献度/各修复方法贡献度中的最大值*100)Contribution of this method = int (contribution of this method / maximum value of contribution of each repair method * 100)
具体的“实施难度”和“贡献度”是在故障检测和故障修复的过程中均需要考虑的:根据输入的故障设备名称和类型,数据库会匹配出相应的检测方法和解决方法,这两个方法会根据“实施难度”和“贡献度”进行难以程度排序,最终才能确定采用哪种方法。The specific "implementation difficulty" and "contribution degree" need to be considered in the process of fault detection and fault repair: according to the input name and type of faulty equipment, the database will match the corresponding detection methods and solutions. These two Methods will be ranked according to their degree of difficulty based on "difficulty of implementation" and "contribution" before deciding which method to use.
以下是一个具体诊断过程的描述:The following is a description of a specific diagnostic process:
1、确定问题域及相关设备。可以点选,也可以使用自然语言输入问题,借助系统的匹配功能来确定。例如,当前的问题设备是pc_1(该设备在系统中的名称)。1. Determine the problem domain and related equipment. You can click on it, or you can enter the question using natural language and use the system's matching function to determine it. For example, the current problem device is pc_1 (the name of the device in the system).
2、通过交互去除冗余回路。2. Remove redundant loops through interaction.
3、进入诊断流程,系统按照检测方法的“实施难度”数值进行排序后,在“问题排查”栏列出所有当前步骤对应的检测建议。3. Enter the diagnosis process. After the system sorts the detection methods according to the "Implementation Difficulty" value, it lists all detection suggestions corresponding to the current step in the "Troubleshooting" column.
4、用户按照建议的检测方法进行检测,并通过点选问题反馈答案,系统根据答案进行推理;4. The user conducts detection according to the recommended detection method and feedbacks the answer by clicking on the question, and the system makes inferences based on the answer;
5、诊断过程中,系统根据用户的不同反馈,给出相应的结论,或者提出进一步的问题、不断缩小排查范围。5. During the diagnosis process, the system will give corresponding conclusions based on different feedback from users, or raise further questions and continuously narrow down the scope of investigation.
重复3-5,直至定位到问题所在。系统按照修复方法的贡献度和实施难度,综合排序后给出建议。若定位到的设备有内部结构,用户可以继续钻取。Repeat steps 3-5 until the problem is located. The system gives suggestions after comprehensive ranking based on the contribution and implementation difficulty of the repair methods. If the located device has internal structures, the user can continue to drill.
按照事件处理结果——故障按系统所建议的方法解决,故障按用户自己的方法解决,故障未解决——系统自动生成报告。用户按需编辑后提交,更新系统的统计数据,同时作为历史资料留存。According to the event processing results - the fault is solved according to the method recommended by the system, the fault is solved according to the user's own method, and the fault is not solved - the system automatically generates a report. Users can edit and submit as needed, update the system's statistical data, and retain it as historical data.
针对上述方法,本发明还公开了一种基于有向无环图的网络故障定位排查系统,如图2所示,包括:In view of the above method, the present invention also discloses a network fault location and troubleshooting system based on directed acyclic graph, as shown in Figure 2, including:
设备获取模块201,用于获取计算机网络系统及其组成设备;The device acquisition module 201 is used to acquire the computer network system and its component devices;
图像绘制模块202,用于根据所述获取计算机网络系统及其组成设备绘制有向无环图;The image drawing module 202 is used to draw a directed acyclic graph according to the acquisition computer network system and its component equipment;
故障回路确定模块203,用于根据所述有向无环图获取出现网络故障的故障回路;The fault loop determination module 203 is used to obtain the fault loop in which a network fault occurs according to the directed acyclic graph;
排查方法选择模块204,用于根据实施难度或贡献度获取所述故障回路的排查方法,所述排查方法按照检测方向不同分为绕过被检设备、上游方向互联网回路以及下游方向设备终端回路;The troubleshooting method selection module 204 is used to obtain the troubleshooting method of the fault loop based on the difficulty of implementation or degree of contribution. The troubleshooting method is divided into bypassing the inspected device, upstream direction Internet loop, and downstream direction equipment terminal loop according to different detection directions;
第一故障定位模块205,用于采用所述排查方法对所述故障回路进行故障定位;The first fault location module 205 is used to locate faults on the fault circuit using the troubleshooting method;
判断模块206,用于判断所述故障是否为所述被检设备;Determination module 206, used to determine whether the fault is the device being inspected;
第一显示模块207,用于当被检设备出现故障时显示所述被检设备预设的故障解决方案;The first display module 207 is used to display the preset fault solution of the inspected equipment when the inspected equipment fails;
第二故障定位模块208,用于采用所述排查方法对上游方向互联网回路以及下游方向设备终端回路进行故障定位;The second fault location module 208 is used to use the troubleshooting method to locate faults on the upstream direction Internet loop and the downstream direction equipment terminal loop;
第二显示模块209,用于显示上游方向互联网回路以及下游方向设备终端回路中故障节点预设的故障解决方案。The second display module 209 is used to display the preset fault solutions for faulty nodes in the upstream direction Internet loop and the downstream direction equipment terminal loop.
本发明技术方案带来的有益效果:The beneficial effects brought by the technical solution of the present invention are:
本发明基于计算机网络的特点,设计了便于存储专家知识的数据结构,对检测方法进行分类、并在此基础上构建推理逻辑。这能够稳定高效地提供定位和修复故障的专家建议,减少网络连接中断带来的损失;而且将专家知识以易理解的、全面、可量化的方式记录在系统中,随系统分享给使用者;此外,系统运行对单个专家的依赖降低,降低单点风险;而且能够自动完成事件记录和统计,方便定量评估系统不稳定性的来源,进而有针对地改进。Based on the characteristics of computer networks, the present invention designs a data structure that is convenient for storing expert knowledge, classifies detection methods, and builds reasoning logic on this basis. This can provide expert advice on locating and repairing faults stably and efficiently, reducing losses caused by network connection interruptions; and recording expert knowledge in the system in an easy-to-understand, comprehensive, and quantifiable manner, and sharing it with users along with the system; In addition, system operation is less dependent on a single expert, reducing single-point risks; it can also automatically complete event recording and statistics, making it easier to quantitatively assess the sources of system instability and then make targeted improvements.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other. As for the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.
本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本发明的限制。This article uses specific examples to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only used to help understand the method and the core idea of the present invention; at the same time, for those of ordinary skill in the art, according to the present invention There will be changes in the specific implementation methods and application scope of the ideas. In summary, the contents of this description should not be construed as limitations of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011124262.0A CN112231523B (en) | 2020-10-20 | 2020-10-20 | Network fault positioning and troubleshooting method and system based on directed acyclic graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011124262.0A CN112231523B (en) | 2020-10-20 | 2020-10-20 | Network fault positioning and troubleshooting method and system based on directed acyclic graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112231523A CN112231523A (en) | 2021-01-15 |
CN112231523B true CN112231523B (en) | 2024-01-16 |
Family
ID=74118150
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011124262.0A Active CN112231523B (en) | 2020-10-20 | 2020-10-20 | Network fault positioning and troubleshooting method and system based on directed acyclic graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112231523B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116737511A (en) * | 2023-08-10 | 2023-09-12 | 山景智能(北京)科技有限公司 | Graph-based scheduling job monitoring method and device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1998022828A1 (en) * | 1996-11-19 | 1998-05-28 | Anpico | Device and method for analysing faults on networks |
CA2353778A1 (en) * | 2000-07-25 | 2002-01-25 | Symmetricom, Inc. | Subscriber loop repeater loopback for fault isolation |
CN101106424A (en) * | 2006-07-11 | 2008-01-16 | 阿尔卡特朗讯公司 | Method and device for monitoring optical connection channels of a transmission optical network |
CN102158360A (en) * | 2011-04-01 | 2011-08-17 | 华中科技大学 | Network fault self-diagnosis method based on causal relationship positioning of time factors |
US8948596B2 (en) * | 2011-07-01 | 2015-02-03 | CetusView Technologies, LLC | Neighborhood node mapping methods and apparatus for ingress mitigation in cable communication systems |
CN106992877A (en) * | 2017-03-08 | 2017-07-28 | 中国人民解放军国防科学技术大学 | Network Fault Detection and restorative procedure based on SDN frameworks |
CN107846330A (en) * | 2017-12-18 | 2018-03-27 | 深圳创维数字技术有限公司 | A kind of network fault detecting method, terminal and computer-readable medium |
CN109039763A (en) * | 2018-08-28 | 2018-12-18 | 曙光信息产业(北京)有限公司 | A kind of network failure nodal test method and Network Management System based on backtracking method |
CN109460833A (en) * | 2018-10-22 | 2019-03-12 | 国家电网有限公司 | Method and system for processing equipment data and repair work order data of distribution network faults |
CN111786364A (en) * | 2020-06-02 | 2020-10-16 | 国电南瑞科技股份有限公司 | Distributed complex power distribution network fault rapid self-healing control method and system |
-
2020
- 2020-10-20 CN CN202011124262.0A patent/CN112231523B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1998022828A1 (en) * | 1996-11-19 | 1998-05-28 | Anpico | Device and method for analysing faults on networks |
CA2353778A1 (en) * | 2000-07-25 | 2002-01-25 | Symmetricom, Inc. | Subscriber loop repeater loopback for fault isolation |
CN101106424A (en) * | 2006-07-11 | 2008-01-16 | 阿尔卡特朗讯公司 | Method and device for monitoring optical connection channels of a transmission optical network |
CN102158360A (en) * | 2011-04-01 | 2011-08-17 | 华中科技大学 | Network fault self-diagnosis method based on causal relationship positioning of time factors |
US8948596B2 (en) * | 2011-07-01 | 2015-02-03 | CetusView Technologies, LLC | Neighborhood node mapping methods and apparatus for ingress mitigation in cable communication systems |
CN106992877A (en) * | 2017-03-08 | 2017-07-28 | 中国人民解放军国防科学技术大学 | Network Fault Detection and restorative procedure based on SDN frameworks |
CN107846330A (en) * | 2017-12-18 | 2018-03-27 | 深圳创维数字技术有限公司 | A kind of network fault detecting method, terminal and computer-readable medium |
CN109039763A (en) * | 2018-08-28 | 2018-12-18 | 曙光信息产业(北京)有限公司 | A kind of network failure nodal test method and Network Management System based on backtracking method |
CN109460833A (en) * | 2018-10-22 | 2019-03-12 | 国家电网有限公司 | Method and system for processing equipment data and repair work order data of distribution network faults |
CN111786364A (en) * | 2020-06-02 | 2020-10-16 | 国电南瑞科技股份有限公司 | Distributed complex power distribution network fault rapid self-healing control method and system |
Non-Patent Citations (3)
Title |
---|
automatic model-based fault detection and diagnosis using diagnostic directed acyclic graph for a demand-controlled ventilation and heating system in simulink;Ali Behravan等;2018 Annual IEEE International systems conference;1-7 * |
基于SOP的主动式谐振接地配电网单相接地故障区段定位方法;叶雨晴;马啸;林湘宁;李正天;许烽;王朝亮;倪晓军;丁超;;中国电机工程学报;第40卷(第05期);1453-1465 * |
线路保护光纤通道告警分析处理与改进设想;王荣超;;电工电气(第11期);44-48+67 * |
Also Published As
Publication number | Publication date |
---|---|
CN112231523A (en) | 2021-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109787817B (en) | Network fault diagnosis method, device and computer-readable storage medium | |
US10389596B2 (en) | Discovering application topologies | |
US12058015B2 (en) | Systems and methods for an interactive network analysis platform | |
CN110493025B (en) | A method and device for fault root cause diagnosis based on multi-layer directed graph | |
CN106992877B (en) | Network Fault Detection and restorative procedure based on SDN framework | |
CN102611568B (en) | A kind of failure service path diagnostic method and device | |
CN109150635B (en) | Fault influence analysis method and device | |
CN102158360A (en) | Network fault self-diagnosis method based on causal relationship positioning of time factors | |
CN104796273A (en) | Method and device for diagnosing root of network faults | |
CN105740133B (en) | A kind of Distributed Application method for monitoring performance based on service call topology | |
CN117041029A (en) | Network equipment fault processing method and device, electronic equipment and storage medium | |
CN112231523B (en) | Network fault positioning and troubleshooting method and system based on directed acyclic graph | |
CN118740678A (en) | Fault detection method and device for network equipment and electronic equipment | |
US20220360509A1 (en) | Network adaptive monitoring | |
US20230198866A1 (en) | Triggered automation framework | |
Hata et al. | Alarm correlation method using bayesian network in telecommunications networks | |
CN117294587A (en) | Network configuration fault analysis method, server and medium | |
US11102103B2 (en) | Network stabilizing tool | |
Wang et al. | {NetAssistant}: Dialogue based network diagnosis in data center networks | |
CN114666373A (en) | Internet of things terminal maintenance method and related equipment | |
Kannan et al. | A differential approach for configuration fault localization in cloud environments | |
CN115544202A (en) | Alarm processing method, device and storage medium | |
US20230336400A1 (en) | Network intent cluster | |
CN115865621B (en) | Network fault diagnosis method, device, equipment and storage medium | |
Wu et al. | Resilience in Industrial Internet of Things Systems: A Communication Perspective |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20250408 Address after: Room A707C-48, Runxiang Business Center, Zhe Lu Street, Jinghu District, Wuhu City, Anhui Province 241000 Patentee after: Wuhu Manxiu Technology Co.,Ltd. Country or region after: China Address before: 510000 room 504, block a, Hengda business center, 3 nangui Road, Luopu street, Panyu District, Guangzhou City, Guangdong Province Patentee before: Guangzhou Zhitu Technology Co.,Ltd. Country or region before: China |