CN115733724B

CN115733724B - Method, device, electronic device and storage medium for locating root cause of business failure

Info

Publication number: CN115733724B
Application number: CN202110996024.7A
Authority: CN
Inventors: 张晓民; 尧平; 陈乐�; 丁泽伟; 罗朝彤; 薛蓉蓉; 郑伟; 朱国忠; 邓忻; 刘聪; 徐思敏
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2024-11-15
Anticipated expiration: 2041-08-27
Also published as: CN115733724A

Abstract

The present invention provides a method, device, electronic device and storage medium for locating the root cause of a business failure, the method comprising: determining the moment when an abnormality exists in a target indicator of a target business as the abnormal moment; confirming each root cause node on the call chain based on information of a call chain of the target business in a target time period with the abnormal moment as the end moment; determining the root cause alarm in the alarm corresponding to each root cause node based on the alarm corresponding to each root cause node and the topological relationship between each device. The method, device, electronic device and storage medium for locating the root cause of a business failure provided by the present invention can realize fast and accurate root cause location by determining the moment when an abnormality exists in a target indicator of a target business as the abnormal moment, confirming each root cause node based on information of a call chain of the target business in a target time period before the abnormal moment, and determining the root cause alarm based on the alarm corresponding to each root cause node and the topological relationship between each device.

Description

Service fault root cause positioning method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of communications networks, and in particular, to a method and apparatus for locating a root cause of a service fault, an electronic device, and a storage medium.

Background

With the development of communication services, a conventional service system is gradually decoupled, and a complete service system comprises a plurality of sub-service systems. Under the condition of complex topological relation and calling relation in a service system, the root cause positioning of service system faults is called a technical problem to be solved urgently.

Currently, professional operation and maintenance staff usually conduct service fault root cause positioning based on manual expert experience, the labor is more, the maintenance pressure is high, and the operation and maintenance staff needs to build up sufficient experience accumulation on a service system. However, the prior art cannot quickly and accurately locate the core root cause node and discover the core alarm root cause.

Disclosure of Invention

The invention provides a service fault root cause positioning method, a device, electronic equipment and a storage medium, which are used for solving the defect that root cause positioning can not be performed rapidly and accurately in the prior art and realizing rapid and accurate root cause positioning.

In a first aspect, the present invention provides a method for locating a root cause of a service fault, including:

determining the moment when the target index of the determined target service is abnormal as an abnormal moment;

Based on the information of the call chain of the target service in the target time period taking the abnormal time as the end time, each root cause node on the call chain is truly realized;

And determining root cause alarms in the alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation among the devices.

In one embodiment, the target metrics include: average delay and success rate.

In one embodiment, the determining, based on the information of the call chain of the target service in the target time period taking the abnormal time as the ending time, each root cause node on the call chain specifically includes:

Determining each abnormal node on a call chain based on the information of the call chain of the target service in the target time period;

and determining the root cause nodes in the abnormal nodes based on the information of the abnormal nodes.

In one embodiment, the determining the root cause alarm in the alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation between the devices specifically includes:

Determining association relation scores between alarms corresponding to two root cause nodes included in each target alarm pair based on a preset association rule;

and determining root cause alarms in alarms corresponding to each root cause node based on the target alarm pairs and the association relation scores of the alarms corresponding to the two root cause nodes included in each target alarm pair.

In one embodiment, the determining the root cause node in the abnormal nodes based on the information of the abnormal nodes specifically includes:

Determining possible root cause nodes in the abnormal nodes based on the depth and the fault rate of each abnormal node in a call chain;

the root nodes in the abnormal nodes are determined based on the information of the child nodes of each possible root node.

In one embodiment, the determining the root cause alarm in the alarms corresponding to the root cause nodes based on the target alarm pairs and the association scores between the two alarms corresponding to the root cause nodes included in each target alarm pair specifically includes:

Generating an undirected graph based on the association relation scores between the alarms corresponding to the root cause nodes included in each target alarm pair;

And determining root cause alarms in alarms corresponding to the root cause nodes based on the maximum spanning tree of the undirected graph.

In one embodiment, the generating an undirected graph based on the target alarm pairs and the association scores between the two alarms corresponding to the root node included in each target alarm pair specifically includes:

Dividing the alarms corresponding to the root cause nodes into a plurality of target events based on the alarm time of the alarms corresponding to the root cause nodes;

an undirected graph is generated for each target event separately.

In a second aspect, the present invention provides a service fault root cause positioning device, including:

The index detection module is used for determining the abnormal moment of the target index of the determined target service as the abnormal moment;

The node positioning module is used for determining each root cause node on the call chain based on the information of the call chain of the target service in the target time period taking the abnormal moment as the end moment;

And the alarm positioning module is used for determining root cause alarms in alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation among the devices.

In a third aspect, the present invention provides an electronic device, including a processor and a memory storing a computer program, where the processor implements the steps of any one of the service fault root cause locating methods described above when executing the computer program.

In a fourth aspect, the present invention provides a processor readable storage medium storing a computer program for causing the processor to execute the steps of any one of the above-mentioned service fault root cause localization methods.

According to the business fault root cause positioning method, the business fault root cause positioning device, the electronic equipment and the storage medium, the moment when the target index of the target business is determined to be abnormal is determined to be the abnormal moment, each root cause node on the chain is truly called based on the information of the calling chain of the target business in the target time period taking the abnormal moment as the ending moment, the root cause alarms in the alarms corresponding to each root cause node are determined based on the alarms corresponding to each root cause node and the topological relation among the equipment, and therefore quick and accurate root cause positioning can be achieved. Further, the service root causes the application architecture of the positioning scene to link, has obvious hierarchical layout, can obtain three finer independent scenes through decoupling analysis of the service scene, is helpful for defining actual services of each scene, reduces complexity of each scene algorithm, is quicker and more accurate in root cause positioning, avoids the defect that the traditional method is not in hierarchical decoupling of root cause positioning logic and is difficult to realize quick positioning, can realize effective monitoring of a service system, fuses operation management and operation and maintenance guarantee, and can position based on a physical network element structure.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for locating a root cause of a service fault provided by the invention;

FIG. 2 is a schematic diagram of a call chain in a service fault root cause positioning method provided by the invention;

FIG. 3 is a schematic diagram of a generation process of a maximum spanning tree in a service fault root cause positioning method provided by the invention;

FIG. 4 is a schematic diagram of the alarm time in the service fault root cause positioning method provided by the invention;

FIG. 5 is a second schematic diagram of the alarm time in the service fault root cause positioning method according to the present invention;

FIG. 6 is a schematic diagram of a service fault root cause positioning device according to the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The service fault root cause positioning method, device, electronic equipment and storage medium of the invention are described below with reference to fig. 1-6.

Fig. 1 is a flow chart of a service fault root cause positioning method provided by the application. The following describes a service fault root cause positioning method provided by the embodiment of the application with reference to fig. 1. As shown in fig. 1, the method includes: and step 101, determining the moment when the target index of the determined target service is abnormal as the abnormal moment.

Specifically, the execution main body of the service fault root cause positioning method provided by the embodiment of the invention is the service fault root cause positioning device provided by the invention.

The target service may be a certain service carried by the service system.

For a target service, the target index of the service can be detected, and whether the target index of the service is normal or not is judged.

For example, the target index of the target service may be periodically detected, and whether the target index of the target service in the present period is normal may be determined.

For example, the target index of the service in the second target period may be detected, and whether the target index of the service in the second target period is normal may be determined.

The duration of the second target time period may be set according to practical situations, and the embodiment of the present invention is not specifically limited for the specific duration of the second target time period.

The target index is an index for representing whether the service is normal or not. Illustratively, the target metrics may include average delay or success rate, etc. The embodiment of the present invention is not particularly limited with respect to a specific target index. The target index may be colloquially referred to as a "golden index".

Under the condition that the target index in the second target time period of the target service falls into the normal range, the target index of the target service can be determined to be normal without abnormality; and under the condition that the target index in the second target time period of the target service does not fall into the normal range, determining that the target index of the target service is abnormal.

After the target index of the target service is determined to be abnormal, the time when the target index of the target service is determined to be abnormal can be taken as the abnormal time of the target service.

Step 102, based on the information of the call chain of the target service in the target time period taking the abnormal time as the end time, each root cause node on the chain is surely called.

In particular, the call chain technique is a more commonly used method for running services, in which a series of service nodes are provided as a call chain to a user in a certain order, so as to provide various services to the user.

As shown in FIG. 2, each call chain is composed of a plurality of service nodes (simply referred to as "nodes"), and a complete call chain basically presents a relationship of unidirectional propagation of links, and each call chain has a unique identifier traceId which is distinguished from other call chains. According to the calling logic of the calling chain, the time consumption of calling and whether the calling is successful or not can be obtained. The time consumption and success of the call can reflect the node fault information.

The call chain information refers to that call information (time, interface, hierarchy and result) among service nodes is clicked into a log in the process that a service system finishes one-time service call, and then all the clicked data are connected into a tree chain, namely, call chain data are generated.

The call chain of the target service can have one or more.

In the embodiment of the invention, the information of the call chain of the target service in the target time period (the target time period is the first target time period) is acquired. The duration of the first target period is preset, and the end time of the first target period is the abnormal time determined in step 101.

The duration of the first target time period may be set according to actual situations, and the embodiment of the present invention is not specifically limited for the specific duration of the first target time period.

Optionally, the duration of the first target period is 5 minutes.

In a call chain, if the time consumption of the bottom layer node is more or the call fails, the time consumption of the upper layer node is correspondingly increased or failed, and the root cause transmission of the call chain is from bottom to top, so that the node which is the bottom layer is more likely to become the root cause node. Therefore, analysis can be performed based on the call time consumption of each node on each call chain, and each root cause node can be determined.

Step 103, determining root cause alarms in alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation among the devices.

Specifically, based on each root node located at a certain abnormal time in step 102, the topology relationship of the platform architecture of the service system is associated, alarms corresponding to each root node are further located, the relationship between alarms corresponding to each root node is analyzed, and finally the root alarms are located.

The alarms corresponding to the root cause node refer to alarms associated with the root cause node.

According to the embodiment of the invention, the moment when the target index of the target service is determined to be abnormal is determined to be the abnormal moment, the root cause nodes on the chain are truly called based on the information of the call chain of the target service in the target time period taking the abnormal moment as the end moment, and the root cause alarms in the alarms corresponding to the root cause nodes are determined based on the alarms corresponding to the root cause nodes and the topological relation among the devices, so that quick and accurate root cause positioning can be realized. Further, the service root causes the application architecture of the positioning scene to link, has obvious hierarchical layout, can obtain three finer independent scenes through decoupling analysis of the service scene, is helpful for defining actual services of each scene, reduces complexity of each scene algorithm, is quicker and more accurate in root cause positioning, avoids the defect that the traditional method is not in hierarchical decoupling of root cause positioning logic and is difficult to realize quick positioning, can realize effective monitoring of a service system, fuses operation management and operation and maintenance guarantee, and can position based on a physical network element structure.

Based on the foregoing any one of the embodiments, the target index includes: average delay and success rate.

Specifically, the target index may include an average delay and a success rate, and whether the target service is abnormal is determined by combining the average delay and the success rate.

Whether the average delay and the success rate are abnormal or not can be detected by an absolute threshold detection method. The specific logic for judging whether the abnormality exists is as follows:

a. If the average delay > =the average delay threshold value and the success rate > =the success rate threshold value, returning to the time-consuming abnormal fault type;

b. if the average delay is less than the average delay threshold and the success rate is less than the success rate threshold, returning to the failure type with abnormal success rate;

c. if the average delay is less than the average delay threshold and the success rate is more than 0, returning to the abnormal fault type of the database;

d. if the average delay > = average delay threshold value and 0< success rate threshold value, returning to the fault type with time consumption and abnormal success rate at the same time;

e. if average delay > = average delay threshold and success rate= 0, then return time consuming and database simultaneous exception fault type.

The average delay threshold avgTimeThreshold may be set according to the actual situation, for example, the average delay threshold is 1s. The embodiment of the present invention is not particularly limited with respect to the specific value of the average delay threshold.

The success rate threshold successRate may be set according to the actual situation, for example, the success rate threshold is 0.9. The embodiment of the present invention is not particularly limited with respect to the specific value of the success rate threshold.

The embodiment of the invention is different from a complex anomaly detection algorithm, but adopts the simple absolute threshold and classification idea to analyze the target index of the service, the method is quick and effective, and easy to adjust, and the detailed classification can reflect the real service problems under different conditions.

Based on the foregoing in any one of the foregoing embodiments, based on information of a call chain of a target service in a target time period with an abnormal time as an end time, each root node on the chain is actually called, specifically including: and determining each abnormal node on the call chain based on the information of the call chain of the target service in the target time period.

Specifically, for each call chain, it is determined whether the call chain is a call chain that is abnormal in call time based on the call chain's time consumption.

The time consumption statistics of the call chains are normally distributed, and a preset quantile can be adopted to approximate a confidence interval, so that the call chains with abnormal time consumption can be screened out.

Preferably, the preset quantile may be a 95 quantile, and the 95 quantile is used to approximate a data interval with a confidence interval of 5%. If the time consumption of the call chain belongs to the 5% data interval, the call chain is an abnormal call chain.

Alternatively, the preset quantile may be a 90 quantile, and the 90 quantile is used to approximate a data interval with a confidence interval of 10%. If the time consumption of the call chain belongs to the 10% data interval, the call chain is an abnormal call chain.

Nodes on the call chain with time-consuming exceptions can be traversed through a search mode to determine the exception nodes.

The call time consumption of the same type of nodes on the call chain is based on normal statistical distribution.

If the node time consumption meets the abnormal threshold value of the type of node or the node call fails, the node is an abnormal node; if the node time consumption does not meet the abnormal threshold value of the type of node or the node call is successful, but the parent node time consumption of the node is more than twice of the sum of the time consumption of all the nodes at the same layer of the node, the node is a suspected abnormal node.

Taking the partial call chain in fig. 2 as an example, the search process of the call chain abnormal node remains:

Starting osb _001 from the head node as an abnormal node, recording node information { 'osb _001': [1, 0] } (node name: [ abnormal times, depth, whether abnormal (abnormal 0, suspected abnormal 1, normal 2) ] is suspected or not);

continuing to search down to reach the csp_001 node;

If the time consumption of the csf_001 node meets the abnormal threshold of the csf_001, the node information { 'osb _001':1, 1,0], 'csf_001': [1,2,0], };

if the csf_001 is known to be an abnormal node, continuing to search down to find the csf_003 node;

If the time consuming period of the csp_003 node does not meet the abnormal threshold of the csp_003 but is a node with call failure, the node information { 'osb _001': [1, 0], 'csp_001': [1,2,0],

‘csf_003’：[1,3，0]，}；

If the csf_003 is known to be an abnormal node, continuing to search down to find a local_15 node and a local_16 node;

if the local_15 node consumes a node which does not meet the anomaly threshold of local_15 but is failed to call, then the node information { 'osb _001': [1, 0], 'csf_001': [1,2,0], 'csf_003': [1,3,0], 'local_15': [1,4,0] };

local_16 node time consuming does not meet the exception threshold of local_16 and call is successful, then local_16 does not belong to the exception node; the time consumption of the analysis csf_003 is 20s, the time consumption of the csf_003 child node is self-call, local_15, local_16 (4 s, 3s, 2 s) respectively, more than one time of the time consumption of the child node call, then local_16 is suspected abnormal node, then node information { 'osb _001':1, 1,0, 'csf_001': [1,2,0], 'csf_003': [1,3,0], 'local_15': [1,4,0] 'local_16': [1,4,1] };

If the local_15 is known to be an abnormal node, continuing to search down to find the db_003 node of the jdbc layer;

The db_003 node consumes less time than the abnormal threshold of db_003 and the call is successful, then local_16 does not belong to the abnormal node; analyzing that the time consumption of the local_15 is 3s, the time consumption of the local_15 child node is db_003 1s, more than one time of the time consumption of the child node call, db_003 is a suspected abnormal node, and node information is recorded {'osb_001':[1,1,0],'csf_001'：[1,2,0],'csf_003'：[1,3,0],'local_15':[1,4,0]'local_16':[1,4,1],'db_003':[1,5,1]};

If the local_16 is known to be a suspected abnormal node, stopping searching;

When global searching is needed, the known local_16 is a suspected abnormal node and a child node exists;

Continuing searching, judging the follow-up child nodes by adopting calling state judgment and ignoring the influence of delay. For example, the db_003 node will not be an abnormal record under the call of local_16, and then the record db_003 node information {'osb_001':[1,1,0],'csf_001'：[1,2,0],'csf_003'：[1,3,0],'local_15':[1,4,0]'local_16':[1,4,1],'db_003':[1,5,1]} is still consistent with the record information under the call of local_15, and the process is repeated to complete the whole search.

Through the above procedure, each abnormal node can be determined.

Based on the information of the abnormal nodes, each root cause node in the abnormal nodes is determined.

Optionally, if the time consumption of the bottom node is more or the call fails, the time consumption of the upper node is correspondingly increased or failed, and the root cause node in the abnormal node can be determined by analyzing whether the time consumption of the abnormal node is more or the call fails is caused by the cause of the abnormal node or the cause of the descendant node of the abnormal node.

If the abnormal node consumes more time or the call failure is caused by the cause of the abnormal node, the abnormal node can be determined to be a root cause node; if the abnormal node consumes more time or the call failure is caused by the reason of the descendant node of the abnormal node, the abnormal node is not determined to be the root cause node.

The embodiment of the invention is different from a method for generating the abnormal ID for the abnormal event data and storing the mapping relation between the abnormal event and the request ID, effectively utilizes the condition inherited by the node information, analyzes the abnormal state of the node by sub-nodes, determines the root cause through the failure times and the failure rate, and adopts a pruning strategy to accelerate the searching speed, thereby being capable of positioning the root cause node more quickly and accurately. Further, the defect that the fault propagation path of the traditional method needs post-carding and is difficult to quickly optimize the topology structure is avoided.

Based on the content of any of the above embodiments, determining each root node in each abnormal node based on information of each abnormal node specifically includes: based on the depth and failure rate of each abnormal node in the call chain, each possible root node in each abnormal node is determined.

Alternatively, the number of call chains is known to be n _{calling chain}, the total number of calls for a node at a layer (depth of layer) is n _Node, the number of faults is m _{Failure of}, and the number of suspected faults is l _{Suspected of}. The analysis may begin with the deepest outlier node.

If m _{Failure of}>＝0.8*(deep-2)*0.05*n_{calling chain} and the node failure rate p _{Failure rate} is greater than the failure rate threshold, then the node is a possible root node.

Or if m _{Failure of}>＝0.8*(deep-2)*0.1*n_{calling chain} and the node failure rate p _{Failure rate} is greater than the failure rate threshold, then the node is a possible root node.

p_{Failure rate}＝m_{Failure of}/n_Node

The failure rate threshold may be set according to the actual situation. The embodiment of the present invention is not particularly limited with respect to the specific value of the failure rate threshold.

Preferably, the failure rate threshold defaults to 0.5.

Each root node in each abnormal node is determined based on information of child nodes of each possible root node.

Alternatively, if a single possible root node exists at a layer, it is directly the root node. If a plurality of possible root nodes exist under a certain layer, returning the possible root nodes with more faults as the root nodes; if a plurality of possible root nodes exist under a certain layer and the failure times are the same, returning the possible root nodes with higher failure rate as the root nodes; and if the failure times are equal to the failure rate, returning the docker node as the root node.

It should be noted that, if there is a service application call chain topology map, where the X service calls A, B, C, D application nodes, and in the actual searching process, the X service only calls A, B, C application nodes, then the return D node is a missing node.

And integrating the topological relation between the service node and the server node according to the node information obtained by the call chain search statistics to finally obtain the abnormal statistical node information of the following server node (the depth is 2, the total call node number is 1, the suspected abnormal node is the suspected abnormal node number, and the abnormal node is 0.

For example, the number of actual search call chains is 102, analysis is started from the deepest fifth layer, the number of times that default exists (indicating that the node does not exist) and dock_006 faults of the fifth layer meets 108×0.1×0.8×3, the fault rate is larger than the fault rate threshold, and analysis is stopped and the node is directly located to the dock_006 node of the fifth layer based on the principle that the deepest node is most likely to serve as a root cause. (it can be seen that the remaining levels of data, although satisfying the number of failures, do not satisfy the failure rate, and can be used as verification to some extent, to indicate that the true root node is the docker_006 node).

According to the embodiment of the invention, through determining each possible root node in each abnormal node and determining each root node based on the information of the child nodes of each possible root node, the root nodes can be positioned more quickly and accurately.

Based on the content of any one of the foregoing embodiments, determining, based on the alarms corresponding to the root node and the topological relation between the devices, a root alarm among the alarms corresponding to the root node, specifically includes: and determining each target alarm pair and the association relation score between alarms corresponding to two root cause nodes included in each target alarm pair based on a preset association rule.

Optionally, based on a preset association rule, alarm pairs frequently occurring in alarms corresponding to each root cause node are determined. Each alarm pair includes two alarms.

The device refers to the device where the root cause node is located.

Based on a preset association rule, an association score between two alarms included in each alarm pair can be determined.

The association rules may be obtained from historical alert data and real-time alert data streams.

It should be noted that, the primary and secondary rule table (i.e. preset association rule) can be continuously updated according to the historical alarm data and the real-time alarm data stream, so as to mine new rules between alarms and reject wrong association rules, and improve the accuracy of determining the target alarm and obtaining the association relation score, thereby being capable of locating the root cause alarm more quickly and accurately.

And determining root cause alarms in alarms corresponding to the root cause nodes based on the target alarm pairs and the association relation scores of the alarms corresponding to the two root cause nodes included in each target alarm pair.

Alternatively, based on the association score between two alarms included in each alarm pair, a primary-secondary relationship between alarms corresponding to each root cause node can be determined, so that the alarms with relative main relationship can be determined as root cause alarms.

According to the embodiment of the invention, the association relation scores between each target alarm pair and the alarms corresponding to the two root cause nodes included in each target alarm pair are determined based on the preset association rules, and the root cause alarms in the alarms corresponding to each root cause node are determined based on the association relation scores of the alarms corresponding to each target alarm pair and the two root cause nodes included in each target alarm pair, so that the root cause alarms can be positioned more quickly and accurately.

Based on the content of any one of the foregoing embodiments, determining root cause alarms in alarms corresponding to each root cause node based on each target alarm pair and association relation scores between alarms corresponding to two root cause nodes included in each target alarm pair, specifically includes: and generating an undirected graph based on each target alarm pair and the association relation score between the alarms corresponding to the two root cause nodes included in each target alarm pair.

Specifically, the vertexes in the undirected graph represent alarms corresponding to the root cause nodes, the edges in the undirected graph represent alarms represented by two vertexes connected with the edge to form a target alarm pair, and the weight of the edge is the association relation score between the alarms, so that the undirected graph can be generated. The undirected graph in the embodiments of the present invention may be referred to as an alert graph.

And determining root cause alarms in alarms corresponding to each root cause node based on the maximum spanning tree of the undirected graph.

Specifically, the maximum spanning tree algorithm is similar to the minimum spanning tree algorithm in that for a given undirected graph G (V, E), each (u, V) is assigned a weight w, and a subset E ' of the set of edges E is selected in the undirected graph such that all points in the set of points V are connected by edges in E ' and such that the sum of all edge weights in the set E ' is maximized.

Specifically, kruskal's Algorithm (Kruskal Algorithm) or Prim' sAlgorithm (Prim Algorithm) may be used. Both algorithms apply the concept of Greedy. Taking Kruskal's Algorithm as an example, selecting one side (u, v) with the largest weight value at a time, and judging whether the side and the side in E' form a ring or not; if no ring is formed, (u, v) is added to E'; otherwise, delete edge (u, v); this step is repeated until all points in the graph are connected by the tree. The specific implementation of the maximum spanning tree of the undirected graph of vertices 1-6 is shown in fig. 3.

After the maximum spanning tree of the undirected graph is obtained, the alarm represented by the vertex with the degree of invasiveness of 0 can be determined as the root cause alarm.

According to the embodiment of the invention, the undirected graph is generated based on the association relation scores between each target alarm pair and the alarms corresponding to the two root cause nodes included in each target alarm pair, and the root cause alarms in the alarms corresponding to each root cause node are determined based on the maximum spanning tree of the undirected graph, so that the root cause alarms can be positioned more quickly and accurately.

Based on the content of any one of the foregoing embodiments, generating an undirected graph based on each target alarm pair and association scores between alarms corresponding to two root cause nodes included in each target alarm pair, specifically includes: based on the alarm time of the alarms corresponding to each root cause node, the alarms corresponding to each root cause node are divided into a plurality of target events.

Specifically, the association time between the high-frequency time of the same alarm and different alarms can be obtained based on the alarm time of the alarms corresponding to each root cause node.

The high frequency time of the same alarm refers to the average interval of judging two or more alarms of a certain alarm. And the high-frequency time is used for aggregating the same alarms with similar occurrence time in the event aggregation process.

The weighting process of the high frequency time is performed by considering different devices and different time periods.

The method for calculating the high-frequency time is as follows:

firstly, carrying out preliminary calculation of high-frequency time on different equipment according to alarm starting time and alarm ending time, namely merging overlapping time intervals; then, the first occurrence time and the last occurrence time of the alarm are subjected to difference; and finally, carrying out weighting processing according to the information of different times of the alarm divided into the overlapped intervals and on different devices.

Taking the time period of alarm a shown in fig. 4 as an example, the high frequency time on device A1 is: st3-st1, st5-st4, the high frequency time period is 2; the high frequency time on device A2 is: st4-st3, the high frequency time period is 1.

The calculation formula of the high-frequency time of the alarm A is as follows:

Wherein 5/5 and 2/4 represent the weights of the device A1 and the device A2, respectively.

The association time between different alarms refers to the association time length for judging a certain alarm and other alarms. And the association time is used for aggregating two different alarms with similar occurrence time in the event aggregation process.

The weighting process of the association time is performed by considering different devices and different time periods.

The calculation method of the association time is as follows:

Firstly, carrying out preliminary calculation on association time on different equipment according to the starting time and the ending time, namely merging overlapped secondary alarm time intervals; then, the occurrence time of the secondary alarm and the occurrence time of the forward nearest main alarm are subjected to difference; and finally, weighting according to different times of the alarm divided into the main alarm interval and information on different devices.

Taking the alarm shown in fig. 5 as an example, two alarms associated with the alarm B on the device A1 are B1 and B2, the association time is st1B-st4, st3B-st5, and the association times are 2 times. The associated alarm events across all devices are weighted averaged and as a result will tend to be frequent to the device that has the a alarm.

According to the topological relation among the devices, the high-frequency time of the same alarm and the association time among different alarms, a plurality of alarms with association can be divided into the same target event, and a plurality of target events are determined.

It should be noted that, the calculation method of continuously learning the high-frequency time and the associated time can also be performed through the historical data, so that the accuracy of event division is improved in the continuous iterative process.

An undirected graph is generated for each target event separately.

Specifically, for each target event, generating an undirected graph corresponding to the target event based on a target alarm pair included in the target event and an association score between alarms corresponding to two root cause nodes included in each target alarm pair.

The process of generating the undirected graph can be referred to the foregoing embodiments, and will not be described herein.

According to the embodiment of the invention, based on the alarm time of the alarms corresponding to each root cause node, the alarms corresponding to each root cause node are divided into a plurality of target events, and the root cause alarms can be positioned more quickly and accurately aiming at each target event.

The service fault root positioning device provided by the invention is described below, and the service fault root positioning device described below and the service fault root positioning method described above can be correspondingly referred to each other.

Fig. 6 is a schematic structural diagram of a service fault root cause positioning device provided by the invention. Based on the foregoing content of any one of the foregoing embodiments, as shown in fig. 6, the service fault root cause positioning device includes an index detection module 601, a node positioning module 602, and an alarm positioning module 603, where:

The index detection module 601 is configured to determine, as an abnormal time, a time when the target index of the determined target service is abnormal;

The node positioning module 602 is configured to, based on information of a call chain of a target service in a target time period taking an abnormal time as an end time, actually call each root cause node on the chain;

The alarm positioning module 603 is configured to determine root cause alarms in alarms corresponding to root cause nodes based on alarms corresponding to the root cause nodes and topology relationships between devices.

Specifically, the index detection module 601, the node positioning module 602, and the alarm positioning module 603 are electrically connected in sequence.

For a target service, the indicator detection module 601 may detect a target indicator of the service in a second target period, and determine whether the target indicator of the service in the second target period is normal.

Under the condition that the target index in the second target time period of the target service falls into the normal range, the index detection module 601 can determine that the target index of the target service is normal and no abnormality exists; in the case that the target index in the second target period of the target service does not fall within the normal range, the index detection module 601 may determine that the target index of the target service is abnormal.

The node positioning module 602 obtains information of a call chain of a target service in a target time period, analyzes based on call time consumption of each node on each call chain, and determines each root cause node.

The alarm positioning module 603 associates the topological relation of the platform architecture of the service system with each root node under a certain abnormal time based on the positioned root node, further positions the alarms corresponding to each root node, analyzes the relation between the alarms corresponding to each root node, and finally positions the root alarms.

Optionally, the target index includes: average delay and success rate.

Alternatively, the node location module 602 may include:

The abnormal node determining unit is used for determining each abnormal node on the call chain based on the information of the call chain of the target service in the target time period;

and the root node determining unit is used for determining each root node in each abnormal node based on the information of each abnormal node.

Optionally, the alert positioning module 603 may include:

The association unit is used for determining each target alarm pair and association relation scores between alarms corresponding to two root cause nodes included in each target alarm pair based on a preset association rule;

and the positioning unit is used for determining the root cause alarms in the alarms corresponding to the root cause nodes based on the target alarm pairs and the association relation scores of the alarms corresponding to the two root cause nodes included in each target alarm pair.

Alternatively, the root node determining unit may include:

a possible root cause determination subunit configured to determine each possible root cause node in each abnormal node based on a depth and a failure rate of each abnormal node in the call chain;

And the root node determining subunit is used for determining each root node in each abnormal node based on the information of the child nodes of each possible root node.

Alternatively, the positioning unit may include:

The map generation subunit is used for generating an undirected map based on each target alarm pair and the association relation scores between the alarms corresponding to the two root cause nodes included in each target alarm pair;

and the alarm positioning subunit is used for determining root cause alarms in alarms corresponding to the root cause nodes based on the maximum spanning tree of the undirected graph.

Optionally, the graph generating subunit is specifically configured to divide the alarms corresponding to the root cause nodes into a plurality of target events based on the alarm time of the alarms corresponding to the root cause nodes; an undirected graph is generated for each target event separately.

The service fault root cause positioning device provided by the embodiment of the invention is used for executing the service fault root cause positioning method provided by the invention, the implementation mode of the service fault root cause positioning device is consistent with the implementation mode of the service fault root cause positioning method provided by the invention, and the same beneficial effects can be achieved, and the description is omitted here.

The service fault root positioning device is used for the service fault root positioning method of the previous embodiments. Therefore, the description and definition of the service fault root cause locating method in the foregoing embodiments can be used for understanding each execution module in the embodiments of the present invention.

The electronic device and the storage medium provided by the invention are described below, and the electronic device and the storage medium described below and the service fault root positioning method described above can be referred to correspondingly.

Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communication Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may call a computer program in memory 730 to perform the steps of a business fault root cause localization method, including, for example: determining the moment when the target index of the determined target service is abnormal as an abnormal moment; based on the information of the calling chain of the target service in the target time period taking the abnormal time as the end time, each root cause node on the chain is truly called; and determining root cause alarms in the alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation among the devices.

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the steps of the service fault root cause locating method provided by the above methods, for example comprising: determining the moment when the target index of the determined target service is abnormal as an abnormal moment; based on the information of the calling chain of the target service in the target time period taking the abnormal time as the end time, each root cause node on the chain is truly called; and determining root cause alarms in the alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation among the devices.

In another aspect, embodiments of the present application further provide a processor-readable storage medium storing a computer program for causing the processor to execute the steps of the method provided in the foregoing embodiments, for example, including: determining the moment when the target index of the determined target service is abnormal as an abnormal moment; based on the information of the calling chain of the target service in the target time period taking the abnormal time as the end time, each root cause node on the chain is truly called; and determining root cause alarms in the alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation among the devices.

The processor-readable storage medium may be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), and semiconductor storage (e.g., ROM, EPROM, EEPROM, non-volatile storage (NAND FLASH), solid State Disk (SSD)), etc.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The service fault root cause positioning method is characterized by comprising the following steps:

Determining root cause alarms in alarms corresponding to root cause nodes based on alarms corresponding to the root cause nodes and topological relations among devices;

the determining the root cause alarms in the alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation among the devices specifically comprises the following steps:

Determining root cause alarms in alarms corresponding to each root cause node based on the target alarm pairs and the association relation scores of the alarms corresponding to the two root cause nodes included in each target alarm pair;

The determining the root cause alarms in the alarms corresponding to the root cause nodes based on the target alarm pairs and the association relation scores between the two alarms corresponding to the root cause nodes included in each target alarm pair specifically comprises:

Determining root cause alarms in alarms corresponding to all root cause nodes based on the maximum spanning tree of the undirected graph;

generating an undirected graph based on the target alarm pairs and the association scores between the alarms corresponding to the root cause nodes included in each target alarm pair, wherein the undirected graph specifically comprises:

generating an undirected graph for each target event respectively;

the dividing the alarms corresponding to the root cause nodes into a plurality of target events based on the alarm time of the alarms corresponding to the root cause nodes comprises the following steps:

Obtaining association time between high-frequency time of the same alarm and different alarms based on alarm time of alarms corresponding to each cause node, wherein the high-frequency time of the same alarm refers to average interval of two or more times of occurrence of the same alarm, the high-frequency time of the same alarm is used for aggregating the same alarm with similar occurrence time in event aggregation, the association time between different alarms refers to association time of the alarm with other alarms, and the association time between different alarms is used for aggregating different alarms with similar occurrence time in event aggregation;

Dividing a plurality of alarms with relevance into the same target event according to the topological relation among the devices, the high-frequency time of the same alarm and the association time among different alarms, and determining a plurality of target events.

2. The business fault root cause positioning method of claim 1, wherein the target index comprises: average delay and success rate.

3. The service fault root cause positioning method according to claim 1, wherein the determining each root cause node on the call chain based on information of the call chain of the target service in a target time period with the abnormal time as an end time specifically includes:

4. The method for locating root causes of service faults according to claim 3, wherein the determining the root cause node in the abnormal nodes based on the information of the abnormal nodes specifically comprises:

5. A business fault root cause positioning device, comprising:

The alarm positioning module is used for determining root cause alarms in alarms corresponding to the root cause nodes based on the alarms corresponding to the root cause nodes and the topological relation among the devices;

generating an undirected graph for each target event respectively;

6. An electronic device comprising a processor and a memory storing a computer program, characterized in that the processor implements the steps of the service fault root cause localization method according to any one of claims 1 to 4 when the computer program is executed.

7. A processor-readable storage medium, characterized in that the processor-readable storage medium stores a computer program for causing the processor to execute the steps of the service fault root cause localization method according to any one of claims 1 to 4.