CN115150253B

CN115150253B - Fault root cause determining method and device and electronic equipment

Info

Publication number: CN115150253B
Application number: CN202210741173.3A
Authority: CN
Inventors: 张乐奇; 明旭
Original assignee: Hangzhou Ezviz Network Co Ltd
Current assignee: Hangzhou Ezviz Network Co Ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2024-03-08
Anticipated expiration: 2042-06-27
Also published as: CN115150253A

Abstract

The embodiment of the invention provides a fault root cause determining method and device and electronic equipment, which are applied to the technical field of cloud platform fault positioning. The method comprises the following steps: when the cloud platform fault of the cloud platform is monitored, determining a bottom layer fault service from at least one fault service; the fault service is an application service with faults, and the bottom fault service is the fault service at the lowest layer in the call chain to which the fault service belongs; determining a target architecture layer from all architecture layers of the cloud platform; wherein, the target architecture layer is: among the architecture layers, the layer is not lower than the architecture layer of the fault architecture layer, and the fault architecture layer is: in the window period when the bottom layer fault service fails, the lowest architecture layer is positioned in each architecture layer with the failure; determining a target layer unit for realizing the bottom layer fault service from layer units contained in a target architecture layer; and determining the fault root cause of the cloud platform fault based on the unit information of the target layer unit. Through the scheme, the fault root cause can be rapidly determined.

Description

Fault root cause determining method and device and electronic equipment

Technical Field

The invention relates to the technical field of fault positioning of cloud platforms, in particular to a fault root cause determining method and device and electronic equipment.

Background

With development of cloud technology, application scenes of cloud platforms are also wider and wider, and more individuals or enterprises are willing to utilize the cloud platforms to deploy own services. In the use process of the cloud platform, stability is an important factor affecting the use experience of the cloud platform, so when the cloud platform has a fault, how to quickly process the fault of the cloud platform is particularly important.

How to quickly determine the fault root cause of the cloud platform fault is a key factor for improving the cloud platform fault processing efficiency, however, because the service implementation of the cloud platform depends on micro services, the topological relation between different application services in the cloud platform is complex, and the cloud platform adopts a layered architecture, that is, the implementation of the application service depends on architecture units of multiple architecture layers, such as components in a middleware layer, a host or a container of an operating system layer, a server in a basic resource layer and the like, when the cloud platform fault occurs, the multiple application services and the different architecture layers are often simultaneously failed, so that the fault root cause of the cloud platform fault is difficult to be quickly determined.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for determining a root cause of a fault and electronic equipment so as to rapidly determine the root cause of the fault. The specific technical scheme is as follows:

In a first aspect, an embodiment of the present invention provides a method for determining a root cause of a fault, where the method includes:

when the cloud platform fault of the cloud platform is monitored, determining a bottom layer fault service from at least one fault service; the fault service is an application service with faults, and the bottom fault service is a fault service at the lowest layer in the affiliated call chain;

determining a target architecture layer from all architecture layers of the cloud platform; wherein, the target architecture layer is: among the architecture layers, the layer is not lower than the architecture layer of the fault architecture layer, and the fault architecture layer is: in the window period when the bottom layer fault service fails, the lowest architecture layer among the architecture layers with faults exists;

determining a layer unit for realizing the bottom layer fault service from layer units contained in the target architecture layer as a target layer unit;

and determining the fault root cause of the cloud platform fault based on the unit information of the target layer unit.

Optionally, the determining the underlying fault service from the at least one fault service includes:

determining a fault call chain based on the topological relation among all application services of the cloud platform; the fault call chain is a call chain containing at least one fault service;

And determining the fault service with the lowest position in the fault call chain from at least one fault service contained in the fault call chain as the bottom fault service.

Optionally, the determining a fault call chain based on the topological relation among the application services of the cloud platform includes:

determining upstream application services and downstream application services of each fault service based on topological relations among application services of the cloud platform;

and determining a call chain where each fault service is located as a fault call chain based on the upstream application service and the downstream application service of each fault service.

Optionally, after determining the underlying fault service from at least one fault service when the cloud platform is monitored to have a cloud platform fault, and before determining the target architecture layer from the architecture layers of the cloud platform, the method further includes:

determining event operation of each application service in a call chain where the underlying fault service is located in a window period when the underlying fault service is faulty; wherein the event operation includes at least one of a publish service operation, a change service operation, and a configure service operation;

Performing a rollback operation for the event operation;

after the rollback operation is finished, determining whether the cloud platform fault has been eliminated;

and if not, executing the step of determining a target architecture layer from all architecture layers of the cloud platform.

Optionally, the determining a target architecture layer from the architecture layers of the cloud platform includes:

determining a frame layer with the lowest position and faults in a window period when the bottom fault service is faulty in all frame layers of the cloud platform, and taking the frame layer with the lowest position as a fault frame layer;

and taking the architecture layer with the layer position not lower than the fault architecture layer in the architecture layers as a target architecture layer.

Optionally, the determining, in each architecture layer of the cloud platform, an architecture layer with a lowest level and a fault occurs in a window period when the bottom fault service fails, as a fault architecture layer, includes:

determining a failed architecture layer in window periods when the bottom layer fault service fails in all architecture layers of the cloud platform, and taking the failed architecture layer as a preselected architecture layer;

among the pre-selected architecture layers determined, the architecture layer with the lowest layer position is determined as the failed architecture layer.

Optionally, the determining, in each architecture layer of the cloud platform, the architecture layer with the failure in the window period when the bottom failure service fails, as a preselected architecture layer, includes:

determining a window period corresponding time period when the bottom layer fault service fails as a fault time period;

and determining the architecture layer with the fault in the fault time period as a preselected architecture layer in the architecture layers of the cloud platform.

Optionally, the determining, based on the unit information of the target layer unit, a fault root cause of the cloud platform fault includes:

determining the unit information in an abnormal state in the unit information of the target layer unit as abnormal information;

and determining the fault root cause of the cloud platform fault based on the abnormal information.

Optionally, the unit information includes: index information and deployment information; the index information is information describing various index parameters of the layer unit, and the deployment information is information describing the deployment position of the layer unit.

Optionally, after the determining, based on the unit information of the target layer unit, a fault root cause of the cloud platform fault, the method further includes:

A visual report is generated showing the determined root cause of the fault.

Optionally, the visual report is further used for displaying at least one of a topological graph of each fault call chain, a hierarchical structure graph of the cloud platform and an information display area;

the fault call chains are call chains containing at least one fault service, the topological graph of each fault call chain highlights and displays the fault service, the hierarchical architecture image highlights and displays each architecture layer with faults in a window period when the bottom fault service has faults, and the information display area is used for displaying the unit information of the target layer unit.

In a second aspect, an embodiment of the present invention provides a fault root cause determining apparatus, including:

the service determining module is used for determining a bottom layer fault service from at least one fault service when monitoring that the cloud platform has a cloud platform fault; the fault service is an application service with faults, and the bottom fault service is a fault service at the lowest layer in the affiliated call chain;

the framework layer determining module is used for determining a target framework layer from all framework layers of the cloud platform; wherein, the target architecture layer is: among the architecture layers, the layer is not lower than the architecture layer of the fault architecture layer, and the fault architecture layer is: in the window period when the bottom layer fault service fails, the lowest architecture layer among the architecture layers with faults exists;

The unit determining module is used for determining a layer unit for realizing the bottom layer fault service from the layer units contained in the target architecture layer as a target layer unit;

and the root cause determining module is used for determining the fault root cause of the cloud platform fault based on the unit information of the target layer unit.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of the first aspects when executing a program stored on a memory.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored therein, which when executed by a processor, implements the method steps of any of the first aspects.

The embodiment of the invention has the beneficial effects that:

according to the fault root cause determining method, the device and the electronic equipment provided by the embodiment of the invention, when a cloud platform fault occurs in the cloud platform, the bottom fault service can be determined from at least one fault service, the target architecture layer is determined from all architecture layers of the cloud platform, the layer unit for realizing the bottom fault service is determined from the layer units contained in the target architecture layer, and the fault root cause of the cloud platform fault is determined as the target layer unit based on the unit information of the target layer unit.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other embodiments may be obtained according to these drawings to those skilled in the art.

FIG. 1 is a schematic diagram of a call chain according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a layered architecture of a cloud platform according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for determining a root cause of a fault according to an embodiment of the present invention;

FIG. 4 is another schematic diagram of a call chain according to an embodiment of the present invention;

fig. 5 is another schematic diagram of a layered architecture of a cloud platform according to an embodiment of the present invention;

fig. 6 is another schematic diagram of a layered architecture of a cloud platform according to an embodiment of the present invention;

FIG. 7 is another flowchart of a method for determining a root cause of a fault according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an event operation query according to an embodiment of the present invention;

FIG. 9 is another flow chart of a method for determining a root cause of a fault according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a visual report according to an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of a fault root cause determining apparatus according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. .

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, those of ordinary skill in the art will be able to devise all other embodiments that are obtained based on this application and are within the scope of the present invention.

The service implementation of the cloud platform depends on micro services, in short, the service exists in the cloud platform in the form of service flows, each service flow comprises a plurality of application services, the plurality of application services on each service flow form a call chain, and the plurality of application services on the same call chain jointly act to realize the service function.

As shown in fig. 1, an exemplary embodiment of the present invention provides a call chain schematic diagram, which includes an application service a, an application service B, an application service C, and an application service D, where an application of the application service a receives a service request from a user side, performs corresponding service processing after receiving the service request, and sends a processing result to the application service B and the application service C, where the application service B and the application service C continue to perform processing, and sends a processing result to the application service D together, where the application service D continues to perform processing according to a processing result of the application service B and the application service C, so as to implement a service function requested to be implemented by the service request.

Meanwhile, the cloud platform adopts a layered architecture, as shown in fig. 2, the embodiment of the invention provides a layered architecture schematic diagram of the cloud platform, which sequentially comprises an application service layer, a system layer and a basic resource layer from top to bottom, wherein the application service layer is an architecture layer where application services in the cloud platform are located, the system layer is a middleware/database on which the application services depend, and an architecture layer where an operating system is located, and the system layer can be further subdivided into a middleware/database layer and an operating system layer; the above-mentioned basic resource layer is the architecture layer where basic resources (such as network and switch) are located.

Because the service implementation of the cloud platform depends on micro services, the topological relation among different application services in the cloud platform is complex, and the cloud platform adopts a layered architecture, when the cloud platform fails, a plurality of application services and different architecture layers fail at the same time, and the cloud platform is illustrated by taking the foregoing example as an example.

In order to quickly determine a fault root cause, the embodiment of the invention provides a fault root cause determination method, a device and electronic equipment.

It should be noted that, in a specific application, the embodiments of the present invention may be applied to various electronic devices, for example, a personal computer, a server, a mobile phone, and other devices having data processing capabilities. In addition, the fault root determining method provided by the embodiment of the invention can be realized by software, hardware or a combination of software and hardware.

In an embodiment, the embodiment of the invention can be applied to a cloud platform, and the cloud platform can monitor the running state of the cloud platform in real time at the moment, and when the occurrence of the fault of the cloud platform is monitored, the fault root cause determination method provided by the embodiment of the invention can be executed to determine the fault root cause of the fault of the cloud platform.

In another embodiment, the embodiment of the invention can also be applied to the electronic equipment independent of the cloud platform, at this time, the electronic equipment can communicate with the cloud platform in real time so as to monitor the running state of the cloud platform in real time, and if the cloud platform is monitored to have a fault, the fault root cause determination method provided by the embodiment of the invention is executed so as to determine the fault root cause of the fault of the cloud platform.

The method for determining the root cause of the fault provided by the embodiment of the invention can comprise the following steps:

when the cloud platform fault of the cloud platform is monitored, determining a bottom layer fault service from at least one fault service; the fault service is an application service with faults, and the bottom fault service is the fault service at the lowest layer in the call chain to which the fault service belongs;

determining a target architecture layer from all architecture layers of the cloud platform; wherein, the target architecture layer is: among the architecture layers, the layer is not lower than the architecture layer of the fault architecture layer, and the fault architecture layer is: in the window period when the bottom layer fault service fails, the lowest architecture layer is positioned in each architecture layer with the failure;

determining a layer unit for realizing the bottom layer fault service from layer units contained in a target architecture layer as a target layer unit;

According to the scheme provided by the embodiment of the invention, after the bottom fault service is determined, the position of the layer is further determined to be not lower than the target architecture layer of the fault architecture layer, further the target layer unit for realizing the bottom fault service is determined from the target layer units, the position of the fault root cause can be rapidly positioned, and further the fault root cause of the cloud platform fault is determined based on the unit information of the target layer unit.

The fault root determining method provided by the embodiment of the invention is elaborated below with reference to the attached drawings.

As shown in fig. 3, an embodiment of the present invention provides a fault root determining method, including steps S301 to S304, in which:

s301, when a cloud platform fault is monitored to occur on the cloud platform, determining a bottom layer fault service from at least one fault service; the fault service is an application service with faults, and the bottom fault service is the fault service at the lowest layer in the call chain to which the fault service belongs;

the execution main body of the embodiment of the invention can monitor whether the cloud platform has a cloud platform fault in real time, and if the cloud platform has the cloud platform fault, the bottom layer fault service can be determined from at least one fault service.

As can be seen from the foregoing, when the cloud platform fails, a plurality of application services or architecture layers often fail at the same time, and at this time, an underlying failure service can be determined from a plurality of failure services. As shown in fig. 4, an exemplary call chain schematic diagram provided by the embodiment of the present invention is shown in fig. 4, where the call chain on the left side includes an application service a, an application service B, an application service C, and an application service D, and the call chain on the right side includes an application service E, an application service F, and an application service G. When the cloud platform fails, if the application service B, the application service D and the application service F fail, the application service B, the application service D and the application service F serve as failure services, and the application service D and the application service F serve as bottom failure services.

In order to determine the underlying failure service from at least one failure service when the cloud platform is monitored to have a cloud platform failure, various implementation manners may be adopted, and at least the following underlying failure determination manners exist by way of example:

in the first underlying fault service determining mode, when only one fault service exists, the fault service can be directly used as the underlying fault service;

in the second manner of determining the underlying fault service, when there are a plurality of fault services, the fault services belonging to the same call chain may be divided into a group, and then the most downstream fault service may be determined from the fault services in each group as the underlying fault service.

In the third underlying fault service determining mode, a fault call chain can be determined based on the topological relation among all application services of the cloud platform; and determining the fault service with the lowest position in the fault call chain from at least one fault service contained in the fault call chain as the bottom fault service.

The fault call chain is a call chain containing at least one fault service, and optionally, an upstream application service and a downstream application service of each fault service can be determined based on a topological relation among application services of the cloud platform, and further, the call chain where each fault service is located is determined based on the upstream application service and the downstream application service of each fault service and is used as the fault call chain. For example, still described in the example of fig. 4, the application service B and the application service D are fault services, and based on the topological relation between the application services of the cloud platform, it can be known that: the upstream application service of the application service B is an application service A, and the downstream application service is an application service D; the application service A has no upstream application service, and the downstream application service comprises an application service B and an application service C; the upstream application service of the application service C is an application service A, and the downstream application service is an application service D; the upstream application service of the application service D is the application service B and the application service C, and thus a fault call chain as shown in fig. 4 can be obtained.

S302, determining a target architecture layer from all architecture layers of the cloud platform; wherein, the target architecture layer is: among the architecture layers, the layer is not lower than the architecture layer of the fault architecture layer, and the fault architecture layer is: among the architecture layers which fail in the window period when the underlying failure service fails, the architecture layer with the lowest position is located;

different architecture layers may exist in different cloud platforms, for example, fig. 2 includes three architecture layers, namely an application service layer, a system layer and a base resource layer. Or the system layer is further divided to obtain a hierarchical architecture diagram as shown in fig. 5, which sequentially comprises an application service layer, a middleware/database layer, an operating system layer and a basic resource layer from top to bottom. The application service layer is an architecture layer where layer units such as application services in the cloud platform are located; the middleware/database layer is an architecture layer where various middleware and other layer units on which the application service depends are located; the operating system layer is an architecture layer where layer units such as a host/container where application services are located; the basic resource layer is an architecture layer where layer units such as server hardware, network environment and the like where application services are located.

Generally, when a lower architecture layer fails, a higher architecture layer also fails, so in one implementation, each failed architecture layer may be directly used as a target architecture layer. Alternatively, in a manner of determining the target architecture layer, the method may further include step A1-step A2:

A1, determining a frame layer with a lowest position and a fault in a window period when a bottom fault service is faulty in each frame layer of a cloud platform, and taking the frame layer as a fault frame layer;

the window period when the fault occurs can be a period in a designated time before the fault occurrence time or a period in a designated time before and after the fault occurrence time. Illustratively, the window period may be 8:45-10:15 when the underlying failure service fails at a time of 9:00.

The frame layer with the failure in the window period when the bottom failure service fails, and the abnormality of the layer units in the frame layer is often the cause of the failure of the bottom failure service, so that the embodiment of the invention can determine the frame layer with the failure and the lowest layer in the window period when the bottom failure service fails in all frame layers of the cloud platform as the failure frame layer.

In the determining mode of the fault architecture layer, the architecture layer with the fault in the window period when the bottom fault service is faulty can be determined as a preselected architecture layer, and then the architecture layer with the lowest layer position is determined as the fault architecture layer in the determined preselected architecture layers. Optionally, a time period corresponding to a window period when the underlying fault service fails may be determined first as a fault time period, and then, in each architecture layer of the cloud platform, an architecture layer in which a fault occurs in the fault time period is determined as a preselected architecture layer.

And step A2, setting the architecture layer with the position not lower than the fault architecture layer in the architecture layers as a target architecture layer.

After determining the underlying architecture layer, the architecture layer of the architecture layers that is not lower than the failure architecture layer may be used as the target architecture layer. As shown in fig. 6, an exemplary hierarchical architecture diagram of a cloud platform according to an embodiment of the present invention includes, in order from top to bottom, an application service layer, a middleware/database layer, an operating system layer, and a base resource layer. The failure architecture layer is a middleware/database layer, and it may be determined that both the application service layer and the middleware/database layer are target architecture layers.

S303, determining a layer unit for realizing the bottom layer fault service from layer units contained in a target architecture layer as a target layer unit;

after determining the target architecture layer, the layer unit for implementing the underlying fault service may be determined from the layer units included in the target architecture layer as the target layer unit. Taking the architecture layer shown in fig. 5 as an example, the layer units included in the application service layer are application services, the layer units included in the middleware/database layer are various middleware, the layer units included in the operating system layer are hosts, containers and the like, and the layer units included in the basic resource layer are server hardware, network environments and the like.

When the application service layer is a target architecture layer, the bottom layer fault service can be used as a target layer unit;

when the middleware/database layer is the target architecture layer, middleware on which the bottom fault service is realized in the middleware/database layer can be used as a target layer unit. Alternatively, middleware on which the underlying failure service depends can be found as the target tier element based on matching alarms with middleware/database tiers within the alarm window period with the CMDB (Configuration Management Database ).

When the operating system layer is a target architecture layer, a host and a container deployed by the bottom fault service in the operating system layer can be used as target layer units; optionally, the host/container where the bottom fault service is deployed in the operating system layer is found as the target layer unit by matching with the CMDB according to the alarm of the operating system layer in the same alarm window period.

When the basic resource layer is the target architecture layer, server hardware for realizing the deployment of the bottom fault service and required network environment resources in the basic resource layer can be used as target layer units. Optionally, matching with the CMDB according to the alarm of the basic resource in the same alarm window period, and finding out server hardware and network environment resources where the bottom fault service is deployed in the basic resource layer as a target layer unit.

S304, determining a fault root cause of the cloud platform fault based on the unit information of the target layer unit.

Because the target layer unit is a layer unit on which the bottom layer fault service is realized, the fault root cause of the cloud platform fault can be locked in the abnormality of the target layer unit with high probability, and based on the fault root cause, after the target layer unit is determined, the fault root cause of the cloud platform fault can be determined based on the unit information of the target layer unit.

Optionally, unit information of the target layer unit may be collected first, and then, based on the collected unit information, a fault root cause of the cloud platform fault may be determined. The unit information may include: index information and deployment information; the index information is information describing various index parameters of the layer unit, and the deployment information is information describing the deployment position of the layer unit.

Illustratively, when the target layer unit is a layer unit in the application service layer, the index information includes: at least one of throughput, time consumption, error rate, full GC frequency, number of thread stacks, etc. of the interface. Wherein, full GC refers to the behavior description of a special GC, and the GC can recover the memory of the whole heap at this time, including the old age, the new generation and the like; the deployment information includes: at least one of clusters, machine rooms, nodes, etc.

When the target layer unit is a layer unit in the middleware/database layer, the index information includes: at least one of cache hit rate, slow-looking up of database, message queue accumulation, etc.; the deployment information includes: at least one of clusters, machine rooms, nodes, etc.

When the target layer unit is a layer unit in the operating system layer, the index information includes: at least one of CPU usage, memory usage, disk usage, etc.; the deployment information includes: at least one of a host, a container, etc.

When the target layer unit is a layer unit in the base resource layer, the index information includes: at least one of memory corruption, disk corruption, shared storage anomalies, and the like; the deployment information includes: at least one of switch abnormality, network packet loss rate, etc.

After the unit information of the target unit is obtained, the unit information in an abnormal state in the unit information of the target layer unit can be determined and used as the abnormal information, and then the fault root cause of the cloud platform fault is determined based on the abnormal information. Alternatively, the abnormal unit corresponding to the abnormal information may be confirmed as the cause of the fault.

As shown in fig. 7, an embodiment of the present invention provides a fault root determining method, including steps S701 to S707, in which:

s701, determining a bottom layer fault service from at least one fault service when monitoring that a cloud platform fails; the fault service is an application service with faults, and the bottom fault service is the fault service at the lowest layer in the call chain to which the fault service belongs;

the specific implementation manner is the same as or similar to step S301, and is described in the related description of step S301, which is not repeated here.

S702, determining event operation of each application service in a call chain where the underlying fault service is located in a window period when the underlying fault service is faulty; the event operation comprises at least one of a release service operation, a change service operation and a configuration service operation;

the release service operation is executed by the release system, the record of each execution of the release service operation exists in the release system, the change service operation is executed by the change system, the record of each execution of the change service operation exists in the change system, the configuration service operation is executed by the configuration system, and the record of each execution of the configuration service operation exists in the configuration system. Therefore, the embodiment of the invention can determine the event operation of each application service in the call chain where the bottom fault service is located in the window period when the bottom fault service breaks down by inquiring the release system, the change system and the configuration system. As shown in fig. 8, an event operation query schematic diagram provided by the embodiment of the present invention includes, for each application service, recording all release records for the application service in a release system, where each release record is used for releasing information and release time of the application service, so that, based on the release time in each release record, a release service operation corresponding to each release record in a window period when an underlying fault service fails can be determined. All change records related to the application service are recorded in the change system, and each change record contains information of the changed application service and change time, so that the change service operation corresponding to each change record in a window period when the underlying fault service fails is determined based on the change time in each change record.

The event operation of each application service in the call chain where the underlying fault service is located can be queried in the window period when the underlying fault service fails based on multi-system association.

S703, executing rollback operation for the event operation;

after the time operation is queried, a rollback operation may be performed on the event operation.

S704, after the rollback operation is finished, determining whether the cloud platform fault is eliminated, if not, executing step S705, otherwise, finishing.

If the cloud platform fault is caused by the event operation, the cloud platform fault can be quickly eliminated after the rollback operation is performed, so that the flow can be ended, otherwise, it is stated that if the cloud platform fault is not caused by the event operation, further execution of the fault root determining flow is required at this time, that is, step S705 is performed.

S705, determining a target architecture layer from all architecture layers of the cloud platform; wherein, the target architecture layer is: among the architecture layers, the layer is not lower than the architecture layer of the fault architecture layer, and the fault architecture layer is: in the window period when the bottom layer fault service fails, the lowest architecture layer is positioned in each architecture layer with the failure;

the specific implementation manner is the same as or similar to step S302, and is described in the related description of step S302, which is not repeated here.

S706, determining a layer unit for realizing the bottom layer fault service from layer units contained in a target architecture layer as a target layer unit;

the specific implementation manner is the same as or similar to step S303, and is described in the related description of step S303, which is not repeated here.

S707, determining a fault root cause of the cloud platform fault based on the unit information of the target layer unit.

The specific implementation manner is the same as or similar to step S304, and is described in the related description of step S304, which is not repeated here.

As shown in fig. 9, an embodiment of the present invention provides a fault root determining method, including steps S901 to S905, in which:

s901, when a cloud platform fault is monitored to occur on the cloud platform, determining a bottom layer fault service from at least one fault service; the fault service is an application service with faults, and the bottom fault service is the fault service at the lowest layer in the call chain to which the fault service belongs;

S902, determining a target architecture layer from all architecture layers of the cloud platform; wherein, the target architecture layer is: among the architecture layers, the layer is not lower than the architecture layer of the fault architecture layer, and the fault architecture layer is: in the window period when the bottom layer fault service fails, the lowest architecture layer is positioned in each architecture layer with the failure;

S903, determining a layer unit for realizing the bottom layer fault service from layer units contained in a target architecture layer as a target layer unit;

S904, determining a fault root cause of the cloud platform fault based on the unit information of the target layer unit.

S905, a visual report showing the determined root cause of the fault is generated.

After the fault root cause is determined, a visual report showing the determined fault root cause can be further generated, so that the visual report can be more conveniently provided for operation and maintenance personnel, and the operation and maintenance personnel can determine the fault root cause more intuitively and clearly.

Further, in order to assist the root cause judgment of the operation and maintenance personnel, in one implementation manner, the visual report is further used for displaying at least one of a topological graph of each fault call chain, a layered structure graph of the cloud platform and an information display area; the fault call chains are call chains containing at least one fault service, the topological graph of each fault call chain highlights the fault service, the layered architecture image highlights each architecture layer with faults in a window period when the bottom fault service breaks down, and the information display area is used for displaying unit information of a target layer unit. Exemplary, as shown in fig. 10, a schematic diagram of a visual report provided by an embodiment of the present invention may assist an operator in root cause judgment.

Corresponding to the method for determining a root cause of a fault provided in the foregoing embodiment of the present invention, as shown in fig. 11, an embodiment of the present invention further provides a device for determining a root cause of a fault, where the device includes:

the service determining module 1101 is configured to determine an underlying fault service from at least one fault service when it is monitored that the cloud platform has a cloud platform fault; the fault service is an application service with faults, and the bottom fault service is a fault service at the lowest layer in the affiliated call chain;

an architecture layer determining module 1102, configured to determine a target architecture layer from the architecture layers of the cloud platform; wherein, the target architecture layer is: among the architecture layers, the layer is not lower than the architecture layer of the fault architecture layer, and the fault architecture layer is: in the window period when the bottom layer fault service fails, the lowest architecture layer among the architecture layers with faults exists;

a unit determining module 1103, configured to determine, from among the layer units included in the target architecture layer, a layer unit for implementing the bottom layer fault service as a target layer unit;

and a root cause determining module 1104, configured to determine a fault root cause of the cloud platform fault based on the unit information of the target layer unit.

Optionally, the service determining module includes:

the call chain determining submodule is used for determining a fault call chain based on the topological relation among all application services of the cloud platform; the fault call chain is a call chain containing at least one fault service;

and the service determination submodule is used for determining the fault service with the lowest position in the fault call chain from at least one fault service contained in the fault call chain as the bottom fault service.

Optionally, the call chain determining submodule is specifically configured to determine an upstream application service and a downstream application service of each fault service based on a topological relation among application services of the cloud platform; and determining a call chain where each fault service is located as a fault call chain based on the upstream application service and the downstream application service of each fault service.

Optionally, the apparatus further includes: the fault elimination module is used for determining event operation of each application service in a call chain where the underlying fault service is located in a window period when the underlying fault service is in fault in the window period when the underlying fault service is determined in each architecture layer of the cloud platform after the service determination module determines that the underlying fault service is in fault in the cloud platform and before the architecture layer determination module determines that a target architecture layer is in each architecture layer of the cloud platform; performing a rollback operation for the event operation; after the rollback operation is finished, determining whether the cloud platform fault has been eliminated; if not, calling the architecture layer determining module to execute the step of determining a target architecture layer from all the architecture layers of the cloud platform; wherein the event operation includes at least one of a publish service operation, a change service operation, and a configure service operation.

Optionally, the architecture layer determining module includes:

the first determining submodule is used for determining that a frame layer with the lowest position and faults occurs in a window period when the bottom fault service breaks down in all frame layers of the cloud platform, and the frame layer is used as a fault frame layer;

and the second determining submodule is used for taking the architecture layer with the position not lower than the fault architecture layer in the architecture layers as a target architecture layer.

Optionally, the first determining sub-module includes:

the first determining unit is used for determining a failed architecture layer as a preselected architecture layer in a window period when the bottom layer fault service fails in all architecture layers of the cloud platform;

and the second determining unit is used for determining the architecture layer with the lowest layer position from the determined preselected architecture layers as the fault architecture layer.

Optionally, the first determining unit is specifically configured to determine a window period corresponding to a time period when the underlying fault service fails, as a fault time period; and determining the architecture layer with the fault in the fault time period as a preselected architecture layer in the architecture layers of the cloud platform.

Optionally, the root cause determining module is specifically configured to determine, as the anomaly information, unit information in an anomaly state in the unit information of the target layer unit; and determining the fault root cause of the cloud platform fault based on the abnormal information.

Optionally, the apparatus further includes: and the report generation module is used for generating a visual report showing the determined fault root cause after the root cause determination module determines the fault root cause of the cloud platform fault based on the unit information of the target layer unit.

Optionally, the visual report is further used for displaying at least one of a topological graph of each fault call chain, a hierarchical structure graph of the cloud platform and an information display area; the fault call chains are call chains containing at least one fault service, the topological graph of each fault call chain highlights and displays the fault service, the hierarchical architecture image highlights and displays each architecture layer with faults in a window period when the bottom fault service has faults, and the information display area is used for displaying the unit information of the target layer unit.

The embodiment of the invention also provides an electronic device, as shown in fig. 12, which comprises a processor 1201, a communication interface 1202, a memory 1203 and a communication bus 1204, wherein the processor 1201, the communication interface 1202 and the memory 1203 complete the communication with each other through the communication bus 1204,

a memory 1203 for storing a computer program;

the processor 1201, when executing the program stored in the memory 1203, performs the following steps:

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the steps of any of the above-mentioned fault root cause determination methods.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the fault root determination methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for an apparatus, an electronic device, a computer readable storage medium, a computer program product embodiment, the description is relatively simple, as it is substantially similar to the method embodiment, as relevant see the partial description of the method embodiment.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of fault root determination, the method comprising:

when the cloud platform fault of the cloud platform is monitored, determining a bottom layer fault service from a plurality of fault services; the fault service is an application service with faults, and the bottom fault service is a fault service at the lowest layer in the affiliated call chain;

determining a target architecture layer from a plurality of architecture layers of the cloud platform;

determining a fault root cause of the cloud platform fault based on the unit information of the target layer unit;

the determining a target architecture layer from a plurality of architecture layers of the cloud platform comprises:

determining a frame layer with the lowest position and faults in a window period when the bottom fault service is faulty in a plurality of frame layers of the cloud platform, and taking the frame layer with the lowest position as a fault frame layer;

And setting the architecture layer of the architecture layers which is not lower than the fault architecture layer as a target architecture layer.

2. The method of claim 1, wherein determining an underlying failure service from a plurality of failure services comprises:

determining a fault call chain based on the topological relation among all application services of the cloud platform; the fault call chain is a call chain comprising a plurality of fault services;

and determining the fault service with the lowest position in the fault call chain from a plurality of fault services contained in the fault call chain as an underlying fault service.

3. The method of claim 2, wherein determining a fault call chain based on a topological relation between application services of the cloud platform comprises:

and determining a call chain where each fault service is located as a fault call chain based on the upstream application service and the downstream application service of the plurality of fault services.

4. A method according to any one of claims 1-3, wherein after determining an underlying failure service from a plurality of failure services and before determining a target architecture layer from a plurality of architecture layers of the cloud platform when a cloud platform failure is detected, the method further comprises:

performing a rollback operation for the event operation;

and if not, executing the step of determining a target architecture layer from a plurality of architecture layers of the cloud platform.

5. The method according to claim 1, wherein determining, among the plurality of architecture layers of the cloud platform, an architecture layer having a lowest level and having a failure within a window period when the underlying failure service fails, as a failed architecture layer, comprises:

determining a failed architecture layer as a preselected architecture layer in a window period when the bottom layer fault service fails in a plurality of architecture layers of the cloud platform;

6. The method of claim 5, wherein determining a failed fabric layer of the plurality of fabric layers of the cloud platform within a window period when the underlying failure service fails, as a preselected fabric layer, comprises:

among the plurality of architecture layers of the cloud platform, an architecture layer that fails within the failure time period is determined as a preselected architecture layer.

7. A method according to any one of claims 1-3, wherein determining a fault root cause of the cloud platform fault based on the unit information of the target layer unit comprises:

8. The method of claim 7, wherein the unit information comprises: index information and deployment information; the index information is information describing various index parameters of the layer unit, and the deployment information is information describing the deployment position of the layer unit.

9. A method according to any one of claims 1-3, wherein after said determining a root cause of a failure of said cloud platform based on said target layer unit's unit information, said method further comprises:

a visual report is generated showing the determined root cause of the fault.

10. The method of claim 9, wherein the visual report is further used to present at least one of a topology map of each faulty call chain, a hierarchical structure map of the cloud platform, and an information presentation area;

the fault call chains are call chains containing a plurality of fault services, the topological graph of each fault call chain highlights the fault service, the hierarchical architecture image highlights a plurality of architecture layers with faults in a window period when the bottom fault service breaks down, and the information display area is used for displaying unit information of the target layer unit.

11. A fault cause determination apparatus, the apparatus comprising:

the service determining module is used for determining a bottom layer fault service from a plurality of fault services when monitoring that the cloud platform has a cloud platform fault; the fault service is an application service with faults, and the bottom fault service is a fault service at the lowest layer in the affiliated call chain;

the framework layer determining module is used for determining a target framework layer from a plurality of framework layers of the cloud platform;

The root cause determining module is used for determining a fault root cause of the cloud platform fault based on the unit information of the target layer unit;

the architecture layer determining module is specifically configured to:

12. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-10 when executing a program stored on a memory.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-10.