CN116094906A

CN116094906A - Fault root cause location method and device

Info

Publication number: CN116094906A
Application number: CN202310002613.8A
Authority: CN
Inventors: 周云鹏; 曾永强; 颜学峰; 白姣姣; 李焱
Original assignee: Shenzhen Huawei Cloud Computing Technology Co ltd
Current assignee: Shenzhen Huawei Cloud Computing Technology Co ltd
Priority date: 2023-01-03
Filing date: 2023-01-03
Publication date: 2023-05-09
Anticipated expiration: 2043-01-03
Also published as: CN116094906B

Abstract

The application discloses a fault root cause positioning method and device, and belongs to the technical field of operation and maintenance. The method is applied to a management node for managing a service cluster, wherein the service cluster comprises a plurality of service nodes, and the service nodes are used for realizing user services. The method comprises the following steps: acquiring networking information of a plurality of service nodes; when a service cluster fails, acquiring link states of links among a plurality of service nodes; and carrying out convergence analysis based on the link state and networking information, and determining the root cause of the fault. The method and the device can rapidly locate the fault root cause, and improve the efficiency of locating the fault root cause.

Description

Fault root cause location method and device

技术领域technical field

本申请涉及运维技术领域，特别涉及一种故障根因定位方法及装置。The present application relates to the technical field of operation and maintenance, in particular to a method and device for locating the root cause of a fault.

背景技术Background technique

服务集群用于实现用户的业务，为用户提供服务。服务集群包括多个服务节点。并且，服务集群的规模通常很大，不同的服务节点之间可能会跨越多个服务节点。在现网运维过程中，如果链路中的某一设备出现故障，就需要耗费大量时间和人力才能定位到导致出现故障的故障根因。并且，若定位故障根因耗费的时间过长，还可能导致客户的受损面扩大。The service cluster is used to implement the user's business and provide services for the user. A service cluster includes multiple service nodes. Moreover, the scale of the service cluster is usually large, and different service nodes may span multiple service nodes. During the operation and maintenance of the live network, if a certain device in the link fails, it will take a lot of time and manpower to locate the root cause of the failure. Moreover, if it takes too long to locate the root cause of the fault, it may also lead to an expansion of the customer's damage surface.

目前，在服务节点之间的链路出现故障时，管理节点会发出告警信息，并在告警信息中指示该链路中的源服务节点和目的服务节点。然后再由运维人员逐个排查源服务节点和目的服务节点连接的网络设备，以确定故障根因。At present, when a link between service nodes fails, the management node will send out an alarm message, and indicate the source service node and the destination service node in the link in the alarm message. Then, the operation and maintenance personnel check the network devices connected to the source service node and the destination service node one by one to determine the root cause of the failure.

但是，该定位故障根因的方式定位故障根因的效率很低。However, the efficiency of locating the root cause of the fault in this way of locating the root cause of the fault is very low.

发明内容Contents of the invention

本申请提供了一种故障根因定位方法及装置。本申请提高了对故障根因进行定位的效率，降低了因故障对服务产生影响的概率。本申请提供的技术方案如下：The present application provides a fault root cause location method and device. The present application improves the efficiency of locating the root cause of a fault, and reduces the probability of service being affected by a fault. The technical scheme that this application provides is as follows:

第一方面，本申请提供了一种故障根因定位方法。该方法应用于对服务集群进行管理的管理节点，服务集群包括多个服务节点，服务节点用于实现用户业务。该方法包括：获取多个服务节点的组网信息；在服务集群出现故障时，获取多个服务节点之间链路的链路状态；基于链路状态和组网信息进行汇聚分析，确定故障根因。In a first aspect, the present application provides a method for locating the root cause of a fault. The method is applied to a management node that manages a service cluster. The service cluster includes multiple service nodes, and the service nodes are used to implement user services. The method includes: obtaining networking information of multiple service nodes; obtaining link states of links between multiple service nodes when a service cluster fails; performing aggregation analysis based on link states and networking information to determine the root cause because.

在本申请提供的故障根因定位方法中，通过基于链路状态和组网信息进行汇聚分析，能够在全局范围内自动地进行故障根因定位，能够快速定位到故障根因，提高了对故障根因进行定位的效率，降低了因故障对服务产生影响的概率。In the method for locating the root cause of a fault provided in this application, through aggregation and analysis based on link status and networking information, the root cause of the fault can be automatically located globally, and the root cause of the fault can be quickly located, which improves the accuracy of the fault diagnosis. The efficiency of locating the root cause reduces the probability of service impact due to faults.

在一种实现方式中，多个服务节点均通过位于接入层的接入网络设备接入网络。则基于链路状态和组网信息进行汇聚分析，确定故障根因，包括：当无法获取目标服务节点与其它服务节点之间链路的链路状态，或者，来自其它服务节点的链路状态指示与目标服务节点断链时，确定目标服务节点为候选故障根因；基于组网信息，获取目标服务节点连接的第一接入网络设备，及第一接入网络设备连接的第一服务节点，第一服务节点为第一接入网络设备连接的除目标服务节点外的服务节点；当链路状态指示第一服务节点包括正常节点时，确定目标服务节点为故障根因。In an implementation manner, multiple service nodes access the network through an access network device located at the access layer. Based on the link status and networking information, aggregate analysis is performed to determine the root cause of the failure, including: when the link status of the link between the target service node and other service nodes cannot be obtained, or the link status indication from other service nodes When the link with the target service node is broken, determine the target service node as the candidate root cause of the failure; based on the networking information, obtain the first access network device connected to the target service node and the first service node connected to the first access network device, The first service node is a service node other than the target service node connected to the first access network device; when the link status indicates that the first service node includes a normal node, it is determined that the target service node is the root cause of the fault.

并且，接入网络设备通过位于汇聚层的汇聚网络设备接入网络。则基于链路状态和组网信息进行汇聚分析，确定故障根因，还包括：当链路状态指示第一服务节点均为候选故障根因时，基于组网信息，获取第一接入网络设备连接的第一汇聚网络设备，及第一汇聚网络设备连接的其它接入网设备连接的第二服务节点，第二服务节点为第一汇聚网络设备连接的除目标服务节点外的服务节点；当链路状态指示第二服务节点包括正常节点时，确定第一接入网络设备为故障根因。In addition, the access network device accesses the network through the convergence network device located at the convergence layer. Aggregate analysis based on the link state and networking information to determine the root cause of the fault also includes: when the link state indicates that the first service node is a candidate root cause of the fault, based on the networking information, obtain the first access network device The first convergence network device connected, and the second service node connected to other access network devices connected to the first convergence network device, the second service node is a service node other than the target service node connected to the first convergence network device; when When the link state indicates that the second serving node includes a normal node, it is determined that the first access network device is the root cause of the failure.

进一步的，汇聚网络设备通过位于核心层的核心网络设备接入网络。则基于链路状态和组网信息进行汇聚分析，确定故障根因，还包括：当链路状态指示第二服务节点均为候选故障根因时，基于组网信息，获取第一汇聚网络设备连接的第一核心网络设备，及第一核心网络设备连接的其它汇聚网络设备连接的第三服务节点，第三服务节点为第一核心网络设备连接的除目标服务节点外的服务节点；当链路状态指示第三服务节点包括正常节点时，确定第一汇聚网络设备为故障根因。Further, the aggregation network equipment accesses the network through the core network equipment located at the core layer. Carrying out aggregation analysis based on the link status and networking information to determine the root cause of the failure, also includes: when the link status indicates that the second service node is a candidate for the root cause of the failure, based on the networking information, obtaining the connection of the first aggregation network device The first core network device connected to the first core network device, and the third service node connected to other aggregation network devices connected to the first core network device, the third service node is a service node other than the target service node connected to the first core network device; when the link When the state indicates that the third serving node includes a normal node, it is determined that the first convergence network device is the root cause of the fault.

否则，基于链路状态和组网信息进行汇聚分析，确定故障根因，还包括：当链路状态指示第三服务节点均为候选故障根因时，确定第一核心网络设备为故障根因。Otherwise, performing aggregation analysis based on the link state and networking information to determine the root cause of the fault, further includes: when the link state indicates that the third service node is a candidate root cause of the fault, determining the first core network device as the root cause of the fault.

可选的，服务节点可以检测与其连接的服务节点之间链路的链路状态，并向管理节点发送该链路状态。则获取多个服务节点之间链路的链路状态，包括：接收每个服务节点提供的服务节点与其它服务节点之间链路的链路状态。Optionally, the service node may detect the link status of the link between the service nodes connected to it, and send the link status to the management node. Obtaining link states of links between multiple service nodes includes: receiving link states of links between the service node and other service nodes provided by each service node.

在一种实现方式中，管理节点在获取多个服务节点之间链路的链路状态之前，管理节点还可以先在服务集群中选择具有代表性的服务节点，使得服务集群中的服务节点获取自身与每个具有代表性的服务节点之间链路的链路状态。则该方法还可以包括：管理节点在多个服务节点中确定多个待测服务节点，并向每个服务节点提供多个待测服务节点的信息，使得每个服务节点获取服务节点与每个待测服务节点之间链路的链路状态。相应的，管理节点接收每个服务节点提供的服务节点与其它服务节点之间链路的链路状态，包括：管理节点接收每个服务节点提供的服务节点与每个待测服务节点之间链路的链路状态。In one implementation, before the management node obtains the link status of the links between multiple service nodes, the management node can also first select a representative service node in the service cluster, so that the service nodes in the service cluster can obtain The link state of the link between itself and each representative service node. Then the method may further include: the management node determines a plurality of service nodes to be tested among the plurality of service nodes, and provides information of a plurality of service nodes to be tested to each service node, so that each service node obtains the service node and each The link state of the link between the service nodes to be tested. Correspondingly, the management node receives the link status of the link between the service node and other service nodes provided by each service node, including: the management node receives the link status between the service node and each service node to be tested provided by each service node The link state of the road.

通过管理节点在服务集群中选择待测服务节点，使得服务集群中的服务节点获取自身与每个待测服务节点之间链路的链路状态，这样无需服务节点获取该服务节点与服务集群中每个其它服务节点之间链路的链路状态，能够减小因获取链路状态产生的消耗，降低服务集群出现网络风暴的概率，并降低因获取链路状态对服务集群性能的影响，这种效果在服务集群规模较大时表现尤其明显。Select the service node to be tested in the service cluster through the management node, so that the service node in the service cluster obtains the link status of the link between itself and each service node to be tested, so that the service node does not need to obtain the link status between the service node and the service cluster The link state of the link between each other service node can reduce the consumption caused by obtaining the link state, reduce the probability of a network storm in the service cluster, and reduce the impact of obtaining the link state on the performance of the service cluster. This effect is especially evident when the service cluster is large in scale.

可选的，服务集群中所有待测服务节点的整体网络范围可以覆盖服务集群的网络范围。其中，网络范围为通过网络能够达到的传输范围。这样一来，所有待测服务节点的整体网络范围就覆盖到服务集群使用的网络的所有机柜和所有网段，这样能够保证获取的链路信息的全面性，保证故障根因定位的准确定。Optionally, the overall network range of all service nodes to be tested in the service cluster may cover the network range of the service cluster. Wherein, the network range refers to the transmission range that can be reached through the network. In this way, the overall network scope of all service nodes to be tested covers all cabinets and all network segments of the network used by the service cluster, which ensures the comprehensiveness of the obtained link information and the accurate determination of the root cause of the fault.

在一种实现方式中，链路状态通过以下一个或多个反映：链路的连通状态和传输时延。链路的连通状态用于指示链路是通的还是断的。当链路的连通状态指示链路是断的时，服务节点之间无法利用该链路传输数据。传输时延能够反映链路的状态，当传输时延过大(如超过预期的传输时延阈值)时，有可能是链路是断的，也有可能是链路虽然连通但状态较差。此时若服务节点采用该链路传输数据，无法满足服务节点的服务时效，也会导致服务集群出现故障，因此能够通过链路的传输时延反映链路状态。In an implementation manner, the link state is reflected by one or more of the following: link connectivity state and transmission delay. The connection status of the link is used to indicate whether the link is connected or disconnected. When the connection status of the link indicates that the link is disconnected, the service nodes cannot use the link to transmit data. The transmission delay can reflect the state of the link. When the transmission delay is too large (such as exceeding the expected transmission delay threshold), the link may be broken, or the link may be connected but in a poor state. At this time, if the service node uses this link to transmit data, the service time limit of the service node cannot be satisfied, and the service cluster will also fail. Therefore, the link status can be reflected through the transmission delay of the link.

第二方面，本申请提供了一种故障根因定位装置。该装置应用于对服务集群进行管理的管理节点，服务集群包括多个服务节点，服务节点用于实现用户业务。该装置包括：获取模块，用于获取多个服务节点的组网信息；获取模块，还用于在服务集群出现故障时，获取多个服务节点之间链路的链路状态；处理模块，用于基于链路状态和组网信息进行汇聚分析，确定故障根因。In a second aspect, the present application provides a device for locating the root cause of a fault. The device is applied to a management node for managing a service cluster. The service cluster includes a plurality of service nodes, and the service nodes are used to implement user services. The device includes: an acquisition module, used to acquire networking information of multiple service nodes; an acquisition module, also used to acquire link states of links between multiple service nodes when the service cluster fails; a processing module, used Based on the aggregation analysis based on the link status and networking information, determine the root cause of the fault.

可选的，多个服务节点均通过位于接入层的接入网络设备接入网络。则处理模块，具体用于：当无法获取目标服务节点与其它服务节点之间链路的链路状态，或者，来自其它服务节点的链路状态指示与目标服务节点断链时，确定目标服务节点为候选故障根因；基于组网信息，获取目标服务节点连接的第一接入网络设备，及第一接入网络设备连接的第一服务节点，第一服务节点为第一接入网络设备连接的除目标服务节点外的服务节点；当链路状态指示第一服务节点包括正常节点时，确定目标服务节点为故障根因。Optionally, multiple service nodes access the network through an access network device located at the access layer. Then the processing module is specifically used for: when the link status of the link between the target service node and other service nodes cannot be obtained, or when the link status indication from other service nodes is disconnected from the target service node, determine the target service node Is the candidate root cause of the failure; based on the networking information, obtain the first access network device connected to the target service node, and the first service node connected to the first access network device, the first service node is the first access network device connected to service nodes other than the target service node; when the link state indicates that the first service node includes a normal node, determine that the target service node is the root cause of the fault.

可选的，接入网络设备通过位于汇聚层的汇聚网络设备接入网络。则处理模块，具体用于：当链路状态指示第一服务节点均为候选故障根因时，基于组网信息，获取第一接入网络设备连接的第一汇聚网络设备，及第一汇聚网络设备连接的其它接入网设备连接的第二服务节点，第二服务节点为第一汇聚网络设备连接的除目标服务节点外的服务节点；当链路状态指示第二服务节点包括正常节点时，确定第一接入网络设备为故障根因。Optionally, the access network device accesses the network through the convergence network device located at the convergence layer. Then the processing module is specifically used for: when the link state indicates that the first service node is a candidate root cause of failure, based on the networking information, obtain the first convergence network device connected to the first access network device, and the first convergence network device The second service node connected to other access network devices connected to the device, the second service node is a service node other than the target service node connected to the first convergence network device; when the link status indicates that the second service node includes a normal node, Determine that the first access network device is the root cause of the fault.

可选的，汇聚网络设备通过位于核心层的核心网络设备接入网络。则处理模块，具体用于：当链路状态指示第二服务节点均为候选故障根因时，基于组网信息，获取第一汇聚网络设备连接的第一核心网络设备，及第一核心网络设备连接的其它汇聚网络设备连接的第三服务节点，第三服务节点为第一核心网络设备连接的除目标服务节点外的服务节点；当链路状态指示第三服务节点包括正常节点时，确定第一汇聚网络设备为故障根因。Optionally, the aggregation network device accesses the network through the core network device at the core layer. The processing module is specifically used for: when the link state indicates that the second service node is a candidate root cause of failure, based on the networking information, obtain the first core network device connected to the first convergence network device, and the first core network device The third service node connected to other converging network devices connected, the third service node is the service node other than the target service node connected to the first core network device; when the link status indicates that the third service node includes a normal node, determine the third service node A converged network device is the root cause of the failure.

或者，当链路状态指示第三服务节点均为候选故障根因时，确定第一核心网络设备为故障根因。Or, when the link state indicates that all the third serving nodes are candidate root causes of the failure, determine that the first core network device is the root cause of the failure.

可选的，获取模块，具体用于：接收每个服务节点提供的服务节点与其它服务节点之间链路的链路状态。Optionally, the obtaining module is specifically configured to: receive link statuses of links between the service node and other service nodes provided by each service node.

可选的，获取模块，具体用于：在多个服务节点中确定多个待测服务节点；向每个服务节点提供多个待测服务节点的信息，使得每个服务节点获取服务节点与每个待测服务节点之间链路的链路状态；接收每个服务节点提供的服务节点与每个待测服务节点之间链路的链路状态。Optionally, the acquisition module is specifically used to: determine a plurality of service nodes to be tested among the plurality of service nodes; provide each service node with information on a plurality of service nodes to be tested, so that each service node obtains the The link state of the link between the service nodes to be tested; receiving the link state of the link between the service node and each service node to be tested provided by each service node.

可选的，多个待测服务节点的网络范围覆盖服务集群的网络范围。Optionally, the network ranges of the multiple service nodes to be tested cover the network range of the service cluster.

可选的，链路状态通过以下一个或多个反映：链路的连通状态和传输时延。Optionally, the link state is reflected by one or more of the following: link connectivity state and transmission delay.

第三方面，本申请提供了一种计算设备，包括存储器和处理器，存储器存储有程序指令，处理器运行程序指令以执行本申请第一方面以及其任一种可能的实现方式中提供的方法。In a third aspect, the present application provides a computing device, including a memory and a processor, the memory stores program instructions, and the processor executes the program instructions to perform the method provided in the first aspect of the present application and any possible implementation thereof .

第四方面，本申请提供了一种计算机集群，包括至少一个计算设备，每个计算设备包括处理器和存储器，至少一个计算设备的处理器用于执行至少一个计算设备的存储器中存储的指令，以使得计算设备集群执行本申请第一方面以及其任一种可能的实现方式中提供的方法。In a fourth aspect, the present application provides a computer cluster, including at least one computing device, each computing device includes a processor and a memory, and the processor of the at least one computing device is used to execute instructions stored in the memory of the at least one computing device, so as to The computing device cluster is made to execute the method provided in the first aspect of the present application and any possible implementation manner thereof.

第五方面，本申请提供了一种计算机可读存储介质，该计算机可读存储介质为非易失性计算机可读存储介质，该计算机可读存储介质包括程序指令，当程序指令在计算设备上运行时，使得计算设备执行本申请第一方面以及其任一种可能的实现方式中提供的方法。In a fifth aspect, the present application provides a computer-readable storage medium, the computer-readable storage medium is a non-volatile computer-readable storage medium, and the computer-readable storage medium includes program instructions, when the program instructions are stored on the computing device During operation, the computing device is made to execute the method provided in the first aspect of the present application and any possible implementation manner thereof.

第六方面，本申请提供了一种包含指令的计算机程序产品，当计算机程序产品在计算机上运行时，使得计算机执行本申请第一方面以及其任一种可能的实现方式中提供的方法。In a sixth aspect, the present application provides a computer program product containing instructions. When the computer program product runs on a computer, the computer executes the method provided in the first aspect of the present application and any possible implementation thereof.

附图说明Description of drawings

图1是本申请实施例提供的一种故障根因定位方法所涉及的实施环境的示意图；FIG. 1 is a schematic diagram of an implementation environment involved in a fault root cause location method provided in an embodiment of the present application;

图2是本申请实施例提供的另一种故障根因定位方法所涉及的实施环境的示意图；FIG. 2 is a schematic diagram of an implementation environment involved in another fault root cause location method provided by an embodiment of the present application;

图3是本申请实施例提供的一种故障根因定位方法的过程示意图；Fig. 3 is a process schematic diagram of a method for locating the root cause of a fault provided in an embodiment of the present application;

图4是本申请实施例提供的再一种故障根因定位方法所涉及的实施环境的示意图；FIG. 4 is a schematic diagram of an implementation environment involved in yet another fault root cause location method provided by an embodiment of the present application;

图5是本申请实施例提供的另一种故障根因定位方法的流程图；Fig. 5 is a flow chart of another fault root cause location method provided by the embodiment of the present application;

图6是本申请实施例提供的一种管理节点基于链路状态和组网信息进行汇聚分析，确定故障根因的流程图；FIG. 6 is a flow chart of a management node performing aggregation analysis based on link status and networking information provided by an embodiment of the present application to determine the root cause of a failure;

图7是本申请实施例提供的一种管理节点基于链路状态和组网信息进行汇聚分析的可视化示意图；FIG. 7 is a visual schematic diagram of a management node performing aggregation analysis based on link status and networking information provided by an embodiment of the present application;

图8是本申请实施例提供的另一种管理节点基于链路状态和组网信息进行汇聚分析的可视化示意图；FIG. 8 is a visual schematic diagram of another management node performing aggregation analysis based on link status and networking information provided by an embodiment of the present application;

图9是本申请实施例提供的又一种管理节点基于链路状态和组网信息进行汇聚分析的可视化示意图；FIG. 9 is a visual schematic diagram of yet another management node performing aggregation analysis based on link status and networking information provided by an embodiment of the present application;

图10是本申请实施例提供的再一种管理节点基于链路状态和组网信息进行汇聚分析的可视化示意图；FIG. 10 is a visual schematic diagram of yet another management node performing aggregation analysis based on link status and networking information provided by the embodiment of the present application;

图11是本申请实施例提供的一种故障根因定位装置的示意图；Fig. 11 is a schematic diagram of a device for locating the root cause of a fault provided by an embodiment of the present application;

图12是本申请实施例提供的一种计算设备的结构示意图；Fig. 12 is a schematic structural diagram of a computing device provided by an embodiment of the present application;

图13是本申请实施例提供的一种计算设备集群的结构示意图。Fig. 13 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.

为便于理解，下面先对本申请实施例中涉及的技术和背景进行解释。For ease of understanding, the technologies and backgrounds involved in the embodiments of the present application are firstly explained below.

服务集群用于实现用户的业务，为用户提供服务。服务集群包括多个服务节点。服务集群可以采用分布式部署。例如，服务集群可以为分布式存储集群。分布式存储集群用于提供分布式存储服务。分布式存储服务由一个或多个存储池提供，每个存储池都由一个或多个存储服务器组成，存储池将这些存储服务器上的硬盘抽象成统一的存储空间提供给用户使用。通常地，一个分布式存储集群有一个管理平台，管理平台用于对分布式存储集群进行管理。每个存储服务器上都有装有一个客户端，通过该客户端与管理平台通信。并且，在本申请实施例中，服务集群可以为云服务集群。The service cluster is used to implement the user's business and provide services for the user. A service cluster includes multiple service nodes. Service clusters can be deployed in a distributed manner. For example, the service cluster may be a distributed storage cluster. Distributed storage clusters are used to provide distributed storage services. The distributed storage service is provided by one or more storage pools, and each storage pool is composed of one or more storage servers. The storage pool abstracts the hard disks on these storage servers into a unified storage space for users to use. Generally, a distributed storage cluster has a management platform, and the management platform is used to manage the distributed storage cluster. Each storage server is equipped with a client through which it communicates with the management platform. Moreover, in the embodiment of the present application, the service cluster may be a cloud service cluster.

服务集群的规模通常很大，不同的服务节点之间可能会跨越多个服务节点。在运维过程中，如果服务集群的链路中的某一设备出现故障，就需要耗费大量时间和人力才能定位到导致出现故障的故障根因。并且，若定位故障根因耗费的时间过长，还可能导致客户的受损面扩大。The size of the service cluster is usually very large, and different service nodes may span multiple service nodes. During the operation and maintenance process, if a device in the link of the service cluster fails, it will take a lot of time and manpower to locate the root cause of the failure. Moreover, if it takes too long to locate the root cause of the fault, it may also lead to an expansion of the customer's damage surface.

目前，在服务节点之间的链路出现故障时，管理节点会发出告警信息，并在告警信息中指示该链路中的源服务节点和目的服务节点。然后再由运维人员逐个排查源服务节点和目的服务节点连接的网络设备，以确定故障根因。例如，根据服务集群的网络拓扑，按照源服务节点和目的服务节点的因特网协议(internet protocol，IP)地址，排查源服务器和目的服务器连接的交换机等设备，以确定故障根因。其中，网络拓扑是指构成网络的成员间特定的物理的(即真实的)或者逻辑的(即虚拟的)网络排列方式。At present, when a link between service nodes fails, the management node will send out an alarm message, and indicate the source service node and the destination service node in the link in the alarm message. Then, the operation and maintenance personnel check the network devices connected to the source service node and the destination service node one by one to determine the root cause of the failure. For example, according to the network topology of the service cluster, according to the Internet protocol (internet protocol, IP) addresses of the source service node and the destination service node, check the switches and other devices connected to the source server and the destination server to determine the root cause of the failure. Wherein, the network topology refers to a specific physical (ie, real) or logical (ie, virtual) network arrangement among members constituting the network.

本申请实施例提供了一种故障根因定位方法。该方法可以应用于对服务集群进行管理的管理节点。服务集群包括多个服务节点。服务节点用于实现用户业务。该方法包括：获取多个服务节点的组网信息；在服务集群出现故障时，获取多个服务节点之间链路的链路状态；基于链路状态和组网信息进行汇聚分析，确定故障根因。The embodiment of the present application provides a method for locating the root cause of a fault. The method can be applied to the management node that manages the service cluster. A service cluster includes multiple service nodes. Service nodes are used to implement user services. The method includes: obtaining networking information of multiple service nodes; obtaining link states of links between multiple service nodes when a service cluster fails; performing aggregation analysis based on link states and networking information to determine the root cause because.

在该故障根因定位方法中，通过基于链路状态和组网信息进行汇聚分析，能够在全局范围内自动地进行故障根因定位，能够快速定位到故障根因，提高了对故障根因进行定位的效率，降低了因故障对服务产生影响的概率。In this fault root cause location method, through aggregation and analysis based on link status and networking information, the fault root cause location can be automatically performed on a global scale, and the root cause of the fault can be quickly located, which improves the root cause of the fault. The efficiency of positioning reduces the probability of service impact due to faults.

图1是本申请实施例提供的一种故障根因定位方法所涉及的实施环境的示意图。如图1所示，该实施环境包括：管理节点10和服务集群。服务集群包括多个服务节点20。服务节点20用于实现用户业务。管理节点10用于采用本申请实施例提供的故障根因定位方法，对服务集群进行故障根因定位。例如，管理节点10用于获取服务节点20的组网信息，以及，在服务集群出现故障时，获取多个服务节点20之间链路的链路状态，并基于链路状态和组网信息进行汇聚分析，确定故障根因。FIG. 1 is a schematic diagram of an implementation environment involved in a fault root cause location method provided in an embodiment of the present application. As shown in FIG. 1 , the implementation environment includes: a management node 10 and a service cluster. The service cluster includes multiple service nodes 20 . The service node 20 is used to implement user services. The management node 10 is used to locate the root cause of the fault of the service cluster by using the method for locating the root cause of the fault provided in the embodiment of the present application. For example, the management node 10 is used to obtain the networking information of the service nodes 20, and, when the service cluster fails, to obtain the link status of the links between multiple service nodes 20, and based on the link status and networking information to perform Convergence analysis to determine the root cause of the failure.

管理节点10能够与服务集群中的服务节点20建立通信连接。例如，管理节点10与服务节点20之间可以通过网络建立通信连接。可选的，该网络可以为局域网，也可以为互联网，还可以为其它网络，本申请实施例不作限定。The management node 10 can establish a communication connection with the service nodes 20 in the service cluster. For example, a communication connection may be established between the management node 10 and the service node 20 through a network. Optionally, the network may be a local area network, the Internet, or other networks, which are not limited in this embodiment of the present application.

可选的，管理节点10可以为具有计算功能的计算设备。在一种方式中，该管理节点10可以为服务器。且服务器可以是一台服务器，或者由若干台服务器组成的服务器集群，或者是一个云计算服务中心。其中，云计算服务中心中部署有云服务提供商拥有的大量基础资源。例如云计算服务中心中部署有计算资源、存储资源和网络资源等。云计算服务中心可以利用该大量基础资源，实现本申请实施例提供的故障根因定位方法。Optionally, the management node 10 may be a computing device with a computing function. In one manner, the management node 10 may be a server. And the server can be a server, or a server cluster composed of several servers, or a cloud computing service center. Among them, a large number of basic resources owned by the cloud service provider are deployed in the cloud computing service center. For example, computing resources, storage resources, and network resources are deployed in the cloud computing service center. The cloud computing service center can use the large amount of basic resources to implement the method for locating the root cause of the fault provided in the embodiment of the present application.

当服务节点20通过云计算服务中心实现时，本申请实施例提供的故障根因定位方法实现的功能，可以由云服务提供商在云平台抽象成一种故障根因定位云服务，云平台能够利用云计算中心中的资源向用户提供该故障根因定位云服务。用户在云平台购买该故障根因定位云服务后，能够通过该故障根因定位云服务为用户的服务集群定位故障根因。When the service node 20 is realized through the cloud computing service center, the function realized by the fault root cause location method provided by the embodiment of the present application can be abstracted into a kind of fault root cause location cloud service by the cloud service provider on the cloud platform, and the cloud platform can use The resources in the cloud computing center provide users with cloud services for locating the root cause of the fault. After the user purchases the fault root cause location cloud service on the cloud platform, the user can use the fault root cause location cloud service to locate the root cause of the fault for the user's service cluster.

或者，当服务节点20通过云计算服务中心实现时，本申请实施例提供的故障根因定位方法实现的功能，也可以作为其他云服务的附加服务提供。例如，当服务集群为分布式存储集群，该分布式存储集群用于提供存储云服务时，本申请实施例提供的故障根因定位云服务可以作为存储云服务的附加服务提供，即该故障根因定位方法用于对存储云服务进行故障根因定位，以保证存储云服务的服务质量，提高用户体验。Alternatively, when the service node 20 is implemented by a cloud computing service center, the functions implemented by the fault root cause location method provided in the embodiment of the present application may also be provided as additional services of other cloud services. For example, when the service cluster is a distributed storage cluster, and the distributed storage cluster is used to provide storage cloud services, the fault root cause location cloud service provided in this embodiment of the application can be provided as an additional service of the storage cloud service, that is, the fault root The cause location method is used to locate the root cause of the failure of the storage cloud service, so as to ensure the service quality of the storage cloud service and improve user experience.

可选地，云平台可以是中心云的云平台、边缘云的云平台或包括中心云和边缘云的云平台，本申请实施例对其不做具体限定。并且，在本申请实施例提供的实施环境中，服务节点20也可以通过除云平台外的其他资源平台实现，本申请实施例对其不做具体限定。此时，服务节点20可以通过其他资源平台中的资源实现，并向用户提供相关的服务。Optionally, the cloud platform may be a cloud platform of a central cloud, a cloud platform of an edge cloud, or a cloud platform including a central cloud and an edge cloud, which is not specifically limited in this embodiment of the present application. Moreover, in the implementation environment provided by the embodiment of the present application, the service node 20 may also be implemented by other resource platforms except the cloud platform, which is not specifically limited in the embodiment of the present application. At this time, the service node 20 can be realized by resources in other resource platforms, and provide related services to users.

在本申请实施例中，如图2所示，管理节点10可以属于服务集群，即服务集群包括该管理节点10。此时，该管理节点10不仅能够对服务集群进行故障根因定位，还能够对服务集群中的服务节点20进行管理。例如，管理节点10用于对服务节点20调度任务等。该管理节点10记录有服务集群中所有服务节点20的相关信息。In the embodiment of the present application, as shown in FIG. 2 , the management node 10 may belong to a service cluster, that is, the service cluster includes the management node 10 . At this time, the management node 10 can not only locate the fault root cause of the service cluster, but also manage the service nodes 20 in the service cluster. For example, the management node 10 is used to schedule tasks for the service node 20 and the like. The management node 10 records relevant information of all service nodes 20 in the service cluster.

或者，如图1所示，该管理节点10也可以不属于服务集群，即服务集群不包括该管理节点10。在一种实现方式中，该管理节点10可以为独立于该服务集群的运维平台。例如，运维平台可以对整个数据中心进行运维管理，服务集群可以部署在数据中心的一个可用区(availability zone，AZ)中。此时，该管理节点10还可以对其它集群或系统进行根因定位。例如，当本申请实施例提供的故障根因定位方法实现的功能，作为其他云服务的附加服务提供时，由于云平台能够提供多种云服务，该故障根因定位方法能够用于对云平台提供的所有云服务进行故障根因定位。Alternatively, as shown in FIG. 1 , the management node 10 may not belong to the service cluster, that is, the service cluster does not include the management node 10 . In an implementation manner, the management node 10 may be an operation and maintenance platform independent of the service cluster. For example, the operation and maintenance platform can manage the operation and maintenance of the entire data center, and the service cluster can be deployed in an availability zone (AZ) of the data center. At this time, the management node 10 may also perform root cause location on other clusters or systems. For example, when the function realized by the fault root cause location method provided in the embodiment of the present application is provided as an additional service of other cloud services, since the cloud platform can provide various cloud services, the fault root cause location method can be used for cloud platform All cloud services provided are used for fault root cause location.

可选的，在基于链路状态和组网信息进行汇聚分析，确定故障根因时，管理节点10还可以对该过程进行可视化。如图3所示，管理节点10获取多个服务节点20的组网信息和多个服务节点20之间链路的链路状态后，在进行汇聚分析时，可以在管理节点10的显示界面上实时地展示该汇聚分析的过程，以供用户查看。Optionally, when performing aggregation analysis based on the link state and networking information to determine the root cause of the fault, the management node 10 may also visualize the process. As shown in Figure 3, after the management node 10 obtains the networking information of multiple service nodes 20 and the link status of the links between the multiple service nodes 20, when performing aggregation analysis, it can display on the display interface of the management node 10 The process of the aggregation analysis is displayed in real time for users to view.

另外，服务集群中的服务节点20之间通常通过网络实现连接，当服务集群出现故障时，有可能是服务节点20出现故障，也有可能是服务节点20之间的链接出现了故障。那么，如图4所示，该实施场景还可以包括：接入交换机30、汇聚交换机40和核心交换机50。需要说明的是，图4仅是服务节点20之间连接的一种示例，根据不同场景的应用需求服务节点20之间的连接方式可能会发生变化，本申请实施例对此并不进行限制。且服务节点20之间实现连接的方式也可以根据应用场景发生变化。例如，服务节点20之间的网络设备可能不仅包括交换机，还可以包括路由器或光网络终端(optical network termination，ONT，也称光猫)等。又例如，接入交换机30可以变换为位于接入层的其它类型的接入网络设备，汇聚交换机40可以变换为位于汇聚层的其它类型的汇聚网络设备，核心交换机50可以变换为位于核心层的其它类型的核心网络设备。In addition, the service nodes 20 in the service cluster are usually connected through the network. When the service cluster fails, the service node 20 may fail, or the link between the service nodes 20 may fail. Then, as shown in FIG. 4 , the implementation scenario may further include: an access switch 30 , an aggregation switch 40 and a core switch 50 . It should be noted that FIG. 4 is only an example of connection between service nodes 20, and the connection mode between service nodes 20 may change according to application requirements of different scenarios, which is not limited in this embodiment of the present application. Moreover, the manner of realizing the connection between the service nodes 20 may also change according to the application scenarios. For example, the network equipment between the service nodes 20 may include not only switches, but also routers or optical network terminations (optical network termination, ONT, also called optical modems). For another example, the access switch 30 can be converted into another type of access network device located at the access layer, the aggregation switch 40 can be converted into another type of aggregation network device located at the aggregation layer, and the core switch 50 can be converted into a network device located at the core layer. Other types of core network equipment.

应当理解的是，以上内容是对本申请实施例提供的故障根因定位方法的应用场景的示例性说明，并不构成对于该故障根因定位方法的应用场景的限定，本领域普通技术人员可知，随着业务需求的改变，其应用场景可以根据应用需求进行调整，本申请实施例对其不做一一列举。It should be understood that the above content is an exemplary description of the application scenario of the fault root cause location method provided by the embodiment of the present application, and does not constitute a limitation on the application scenario of the fault root cause location method. Those of ordinary skill in the art know that, As business requirements change, its application scenarios can be adjusted according to the application requirements, and the embodiments of this application do not list them one by one.

下面对本申请实施例提供的故障根因定位方法进行说明。如图5所示，该故障根因定位方法包括以下步骤：The method for locating the root cause of a fault provided by the embodiment of the present application will be described below. As shown in Figure 5, the fault root cause location method includes the following steps:

步骤501、管理节点获取多个服务节点的组网信息。Step 501, the management node acquires networking information of multiple service nodes.

服务节点的组网信息用于指示服务节点的组网状态和组网方式。在一种可实现方式中，服务节点的组网信息可以通过服务节点的网络拓扑表示。The networking information of the service node is used to indicate the networking status and networking mode of the service node. In a practicable manner, the networking information of the service node may be represented by the network topology of the service node.

步骤502、管理节点在服务集群出现故障时，获取多个服务节点之间链路的链路状态。Step 502, the management node obtains the link status of the links between multiple service nodes when the service cluster fails.

服务节点之间链路的链路状态会影响服务节点之间的数据传输过程。在链路正常时，服务节点之间能够正常传输数据，在链路不正常时，服务节点之间的数据传输过程会受到影响。在一种实现方式中，链路状态通过以下一个或多个反映：链路的连通状态和传输时延。链路的连通状态用于指示链路是通的还是断的。当链路的连通状态指示链路是断的时，服务节点之间无法利用该链路传输数据。传输时延能够反映链路的状态，当传输时延过大(如超过预期的传输时延阈值)时，有可能是链路是断的，也有可能是链路虽然连通但状态较差。此时若服务节点采用该链路传输数据，无法满足服务节点的服务时效，也会导致服务集群出现故障，因此能够通过链路的传输时延反映链路状态。其中，当服务节点具有多个端口时，服务节点之间链路的链路状态为服务节点当前使用的端口之间链路的链路状态。例如，服务节点具有主端口和备用端口，且服务节点当前使用主端口实现业务时，服务节点之间链路的链路状态为服务节点的主端口之间链路的链路状态。The link state of the link between service nodes will affect the data transmission process between service nodes. When the link is normal, the service nodes can transmit data normally, and when the link is not normal, the data transmission process between the service nodes will be affected. In an implementation manner, the link state is reflected by one or more of the following: link connectivity state and transmission delay. The connection status of the link is used to indicate whether the link is connected or disconnected. When the connection status of the link indicates that the link is disconnected, the service nodes cannot use the link to transmit data. The transmission delay can reflect the state of the link. When the transmission delay is too large (such as exceeding the expected transmission delay threshold), the link may be broken, or the link may be connected but in a poor state. At this time, if the service node uses this link to transmit data, the service time limit of the service node cannot be satisfied, and the service cluster will also fail. Therefore, the link status can be reflected through the transmission delay of the link. Wherein, when the service node has multiple ports, the link status of the link between the service nodes is the link status of the link between the ports currently used by the service node. For example, a service node has a primary port and a standby port, and when the service node currently uses the primary port to implement services, the link state of the link between the service nodes is the link state of the link between the primary ports of the service nodes.

可选的，管理节点获取多个服务节点之间链路的链路状态，包括：接收每个服务节点提供的服务节点与其它服务节点之间链路的链路状态。服务节点可以检测与其连接的服务节点之间链路的链路状态，并向管理节点发送该链路状态。例如，服务节点可以通过ping命令检测其与其它服务节点之间链路的链路状态。或者，服务节点也可以通过其它方式检测其与其它服务节点之间链路的链路状态，本申请实施例对其不做具体限定。Optionally, the management node acquires link states of links between multiple service nodes, including: receiving link states of links between the service node and other service nodes provided by each service node. The service node can detect the link state of the link between the service nodes connected to it, and send the link state to the management node. For example, a service node may use a ping command to detect link statuses of links between it and other service nodes. Or, the service node may also detect the link status of the link between it and other service nodes in other ways, which is not specifically limited in this embodiment of the present application.

在一种实现方式中，管理节点在获取多个服务节点之间链路的链路状态之前，管理节点还可以先在服务集群中选择具有代表性的服务节点，使得服务集群中的服务节点获取自身与每个具有代表性的服务节点之间链路的链路状态。则该方法还可以包括：管理节点在多个服务节点中确定多个待测服务节点，并向每个服务节点提供多个待测服务节点的信息，使得每个服务节点获取服务节点与每个待测服务节点之间链路的链路状态。相应的，管理节点接收每个服务节点提供的服务节点与其它服务节点之间链路的链路状态，包括：管理节点接收每个服务节点提供的服务节点与每个待测服务节点之间链路的链路状态。其中，管理节点除了向服务节点提供多个待测服务节点的信息，还可以向服务节点发送指令，该指令指示服务节点获取服务节点与每个待测服务节点之间链路的链路状态的指令。或者，管理节点也可以不向服务节点发送该指令，该指示可以通过其它方式实现。例如，管理节点与服务节点可以预先约定在管理节点向服务节点提供待测服务节点的信息时，服务节点需要获取服务节点与每个待测服务节点之间链路的链路状态，并向管理节点发送获取到的链路状态。In one implementation, before the management node obtains the link status of the links between multiple service nodes, the management node can also first select a representative service node in the service cluster, so that the service nodes in the service cluster can obtain The link state of the link between itself and each representative service node. Then the method may further include: the management node determines a plurality of service nodes to be tested among the plurality of service nodes, and provides information of a plurality of service nodes to be tested to each service node, so that each service node obtains the service node and each The link state of the link between the service nodes to be tested. Correspondingly, the management node receives the link status of the link between the service node and other service nodes provided by each service node, including: the management node receives the link status between the service node and each service node to be tested provided by each service node The link state of the road. Wherein, in addition to providing the service node with the information of multiple service nodes to be tested, the management node can also send an instruction to the service node, which instructs the service node to obtain the link status information of the link between the service node and each service node to be tested. instruction. Alternatively, the management node may not send the instruction to the service node, and the instruction may be implemented in other ways. For example, the management node and the service node can agree in advance that when the management node provides the information of the service node to be tested to the service node, the service node needs to obtain the link status of the link between the service node and each service node to be tested, and report to the management The node sends the obtained link state.

并且，服务节点向管理节点提供服务节点与每个待测服务节点之间链路的链路状态的实现方式有多种。例如，服务节点获取到服务节点与每个待测服务节点之间链路的链路状态后，可将该链路状态输出到日志中，管理节点可以从日志中获取该链路状态。或者，本申请实施例提供的故障根因定位方法的实施场景还可以包括采集节点，该采集节点用于从日志中获取链路状态，并向管理节点提供该链路状态。另外，管理节点在获取用于指示链路状态的信息后，还可以对该信息进行数据清洗(如格式转换)等操作，以便于管理节点能够解析该信息。In addition, there are many ways for the service node to provide the management node with the link status of the link between the service node and each service node to be tested. For example, after obtaining the link state of the link between the service node and each service node to be tested, the service node can output the link state to a log, and the management node can obtain the link state from the log. Alternatively, the implementation scenario of the method for locating the root cause of a fault provided in the embodiment of the present application may further include a collection node, which is used to obtain the link status from the log, and provide the link status to the management node. In addition, after the management node acquires the information indicating the link state, it may also perform operations such as data cleaning (such as format conversion) on the information, so that the management node can parse the information.

当服务集群规模较大时，若每个服务节点均针对服务集群中每个其它服务节点获取链路状态，这样会导致服务集群出现网络风暴，并影响服务集群的性能。通过管理节点在服务集群中选择待测服务节点，使得服务集群中的服务节点获取自身与每个待测服务节点之间链路的链路状态，这样无需服务节点获取该服务节点与服务集群中每个其它服务节点之间链路的链路状态，能够减小因获取链路状态产生的消耗，降低服务集群出现网络风暴的概率，并降低因获取链路状态对服务集群性能的影响，这种效果在服务集群规模较大时表现尤其明显。When the scale of the service cluster is large, if each service node obtains the link status for each other service node in the service cluster, this will cause a network storm in the service cluster and affect the performance of the service cluster. Select the service node to be tested in the service cluster through the management node, so that the service node in the service cluster obtains the link status of the link between itself and each service node to be tested, so that the service node does not need to obtain the link status between the service node and the service cluster The link state of the link between each other service node can reduce the consumption caused by obtaining the link state, reduce the probability of a network storm in the service cluster, and reduce the impact of obtaining the link state on the performance of the service cluster. This effect is especially evident when the service cluster is large in scale.

步骤503、管理节点基于链路状态和组网信息进行汇聚分析，确定故障根因。Step 503, the management node performs aggregation analysis based on the link state and networking information, and determines the root cause of the fault.

在网络中，服务集群中的服务节点均通过位于接入层的接入网络设备接入网络，接入网络设备通过位于汇聚层的汇聚网络设备接入网络，汇聚网络设备通过位于核心层的核心网络设备接入网络。在确定故障根因时，可以对为服务节点提供网络连接的网络设备进行汇聚分析。在一种实现方式中，如图6所示，基于链路状态和组网信息进行汇聚分析，确定故障根因，可以包括：In the network, the service nodes in the service cluster access the network through the access network device at the access layer, the access network device accesses the network through the aggregation network device at the aggregation layer, and the aggregation network device accesses the network through the core Network devices are connected to the network. When determining the root cause of the failure, aggregate analysis can be performed on the network devices that provide network connections for service nodes. In an implementation manner, as shown in FIG. 6, performing aggregation analysis based on link status and networking information to determine the root cause of the fault may include:

步骤5031、当无法获取目标服务节点与其它服务节点之间链路的链路状态，或者，来自其它服务节点的链路状态指示与目标服务节点断链时，管理节点确定目标服务节点为候选故障根因。Step 5031, when the link status of the link between the target service node and other service nodes cannot be obtained, or the link status from other service nodes indicates that the link with the target service node is disconnected, the management node determines that the target service node is a candidate failure Root cause.

无法获取目标服务节点与其它服务节点之间链路的链路状态，是指管理节点没有接收到指示目标服务节点与其它服务节点之间链路的链路状态的信息。此时可能是由于该目标服务节点处于断网状态，或者，该目标服务节点采集不到其与其它服务节点的链路信息。无论是这两种情况中的哪一种，都表明目标服务节点无法进行网络传输。但目标服务节点无法进行网络传输可能是由于目标服务节点自身出现问题，也可能是目标服务节点连接的网络出现问题，此时可以先将目标服务节点标记为候选故障根因，以进一步确定该目标服务节点是否为故障根因。Failure to obtain the link state of the link between the target service node and other service nodes means that the management node has not received information indicating the link state of the link between the target service node and other service nodes. At this time, it may be because the target service node is disconnected from the network, or the target service node cannot collect link information with other service nodes. Either of these two cases indicates that the target service node is unable to perform network transmission. However, the failure of the target service node to perform network transmission may be due to a problem with the target service node itself, or a problem with the network connected to the target service node. In this case, you can first mark the target service node as a candidate root cause of failure to further determine the target Whether the service node is the root cause of the failure.

步骤5032、管理节点基于组网信息，获取目标服务节点连接的第一接入网络设备，及第一接入网络设备连接的第一服务节点，第一服务节点为第一接入网络设备连接的除目标服务节点外的服务节点。Step 5032: Based on the networking information, the management node obtains the first access network device connected to the target service node and the first service node connected to the first access network device. The first service node is the first access network device connected to the first service node. A service node other than the target service node.

在确定目标服务节点为候选故障根因后，可以参考向该目标服务节点提供网络的网络设备的状态进行判断，以进一步判断该目标服务节点是否为故障根因。由于服务集群中的服务节点均通过位于接入层的接入网络设备接入网络，则可以先根据该目标服务节点连接的第一接入网络设备连接的服务节点的链路状态，确定该目标服务节点是否为故障根因。在此之前，可以先获取目标服务节点连接的第一接入网络设备，及第一接入网络设备连接的第一服务节点。在一种实现方式中，可以根据服务集群中服务节点的组网信息，确定目标服务节点连接的第一接入网络设备，及第一接入网络设备连接的第一服务节点。After determining that the target service node is the candidate root cause of the failure, it can be judged by referring to the status of the network device providing the network to the target service node, so as to further determine whether the target service node is the root cause of the failure. Since the service nodes in the service cluster access the network through the access network device located at the access layer, the target service node can be determined according to the link status of the service node connected to the first access network device connected to the target service node. Whether the service node is the root cause of the failure. Before this, the first access network device connected to the target service node and the first service node connected to the first access network device may be obtained first. In an implementation manner, the first access network device connected to the target service node and the first service node connected to the first access network device may be determined according to networking information of service nodes in the service cluster.

步骤5033、当链路状态指示第一服务节点包括正常节点时，管理节点确定目标服务节点为故障根因。Step 5033: When the link state indicates that the first service node includes a normal node, the management node determines that the target service node is the root cause of the failure.

在确定目标服务节点连接的第一接入网络设备，及第一接入网络设备连接的第一服务节点后，可以获取第一接入网络设备连接的所有第一服务节点的链路状态，并根据每个第一服务节点的链路状态确定第一服务节点的链路是否正常。第一服务节点的链路正常时，说明该第一服务节点能够通过该链路进行数据传输，则该第一服务节点为正常节点。并且，若第一接入网络设备为故障根因，该第一接入网络设备连接的所有第一服务节点均无法正常工作。因此，当链路状态指示第一接入网络设备连接的所有第一服务节点中存在正常节点时，可以确定该第一接入网络设备处于正常状态，进而确定是目标服务节点自身出现问题，即确定目标服务节点为故障根因。此时，服务集群出现的故障可称为服务节点故障。After determining the first access network device connected to the target service node and the first service node connected to the first access network device, the link states of all first service nodes connected to the first access network device may be obtained, and Whether the link of the first serving node is normal is determined according to the link state of each first serving node. When the link of the first service node is normal, it means that the first service node can perform data transmission through the link, and the first service node is a normal node. Moreover, if the first access network device is the root cause of the fault, all first service nodes connected to the first access network device cannot work normally. Therefore, when the link state indicates that there are normal nodes among all the first service nodes connected to the first access network device, it can be determined that the first access network device is in a normal state, and then it can be determined that the target service node itself has a problem, that is, Determine the target service node as the root cause of the failure. At this time, the failure of the service cluster can be called a failure of the service node.

可选的，在基于链路状态和组网信息进行汇聚分析，确定故障根因时，管理节点还可以对该过程进行可视化。在一种实现方式，管理节点可以以集群粒度进行可视化。例如，如图7所示，在该步骤5033中，当前候选故障根因为目标服务节点，则管理节点可以根据拓扑图，至少展示该目标服务节点、该目标服务节点连接的接入网络设备(图7中为接入交换机30)、及通过该接入网络设备连接的所有第一服务节点。图7中黑色填充的圆点为被确定为候选故障根因的服务节点，白色填充的圆点为未被确定为候选故障根因的服务节点。根据图7可知，第一接入网络设备连接的所有第一服务节点中存在正常节点，则可以确定目标服务节点为故障根因。Optionally, when performing aggregation analysis based on link status and networking information to determine the root cause of a fault, the management node can also visualize the process. In one implementation, management nodes can be visualized at cluster granularity. For example, as shown in Figure 7, in step 5033, the current candidate fault root cause is the target service node, then the management node can at least display the target service node and the access network device connected to the target service node according to the topology diagram (Fig. 7 is the access switch 30) and all first service nodes connected through the access network device. In Fig. 7, the black-filled dots are the service nodes determined as the candidate root cause of the fault, and the white-filled dots are the service nodes not determined as the candidate root cause of the fault. It can be seen from FIG. 7 that if there are normal nodes among all the first service nodes connected to the first access network device, it can be determined that the target service node is the root cause of the failure.

步骤5034、当链路状态指示第一服务节点均为候选故障根因时，管理节点基于组网信息，获取第一接入网络设备连接的第一汇聚网络设备，及第一汇聚网络设备连接的其它接入网设备连接的第二服务节点，第二服务节点为第一汇聚网络设备连接的除目标服务节点外的服务节点。Step 5034: When the link status indicates that the first service node is a candidate root cause of the failure, the management node obtains the first aggregation network device connected to the first access network device and the first aggregation network device connected to the first aggregation network device based on the networking information. A second service node connected to other access network devices, where the second service node is a service node other than the target service node connected to the first aggregation network device.

由于与同一接入网络设备连接的所有服务节点同时出现故障的概率较低，因此，当链路状态指示第一接入网络设备连接的所有第一服务节点均为候选故障根因时，可以确定是该第一接入网络设备或向该第一接入网络设备提供网络的网络设备出现故障。则管理节点可以基于组网信息，获取第一接入网络设备连接的第一汇聚网络设备，及第一汇聚网络设备连接的其它接入网设备连接的第二服务节点，以便于根据其确定故障根因。Since the probability that all service nodes connected to the same access network device fail at the same time is low, when the link status indicates that all the first service nodes connected to the first access network device are candidate root causes of failure, it can be determined that It is the first access network device or a network device providing a network to the first access network device that fails. Then the management node can obtain the first convergence network device connected to the first access network device and the second service node connected to other access network devices connected to the first convergence network device based on the networking information, so as to determine the fault based on them Root cause.

步骤5035、当链路状态指示第二服务节点包括正常节点时，管理节点确定第一接入网络设备为故障根因。Step 5035: When the link state indicates that the second serving node includes a normal node, the management node determines that the first access network device is the root cause of the failure.

在确定第一接入网络设备连接的第一汇聚网络设备，及第一汇聚网络设备连接的所有第二服务节点后，可以获取第一汇聚网络设备连接的所有第二服务节点的链路状态，并根据每个第二服务节点的链路状态确定第二服务节点的链路是否正常。第二服务节点的链路正常时，说明该第二服务节点能够通过该链路进行数据传输，则该第二服务节点为正常节点。并且，若第一汇聚网络设备为故障根因，该第一汇聚网络设备连接的所有第二服务节点均无法正常工作。因此，当链路状态指示第一汇聚网络设备连接的所有第二服务节点中存在正常节点时，可以确定该第一汇聚网络设备处于正常状态，进而确定是第一接入网络设备自身出现问题，即确定第一接入网络设备为故障根因。此时，服务集群出现的故障可称为接入网络设备故障。After determining the first convergent network device connected to the first access network device and all the second service nodes connected to the first converged network device, the link status of all the second service nodes connected to the first converged network device may be acquired, And determine whether the link of the second service node is normal according to the link status of each second service node. When the link of the second service node is normal, it means that the second service node can perform data transmission through the link, and the second service node is a normal node. Moreover, if the first convergence network device is the root cause of the fault, all second service nodes connected to the first convergence network device cannot work normally. Therefore, when the link state indicates that there are normal nodes among all the second service nodes connected to the first aggregation network device, it can be determined that the first aggregation network device is in a normal state, and then it is determined that the first access network device itself has a problem, That is, it is determined that the first access network device is the root cause of the failure. At this time, the failure of the service cluster may be called the failure of the access network device.

如图8所示，在该步骤5035中，当前候选故障根因为第一接入网络设备(图8中为接入交换机30)，则管理节点可以根据拓扑图，至少展示该第一接入网络设备、该第一接入网络设备连接的服务节点、第一汇聚网络设备、及通过该第一汇聚网络设备(图8中为汇聚交换机40)连接的所有第二服务节点。图8中黑色填充的圆点为被确定为候选故障根因的服务节点，白色填充的圆点为未被确定为候选故障根因的服务节点。根据图8可知，第一汇聚网络设备连接的所有第二服务节点中存在正常节点，则可以确定第一接入网络设备为故障根因。As shown in FIG. 8, in step 5035, the current candidate fault root is caused by the first access network device (access switch 30 in FIG. 8), and the management node can at least display the first access network device according to the topology diagram. device, the service node connected to the first access network device, the first convergence network device, and all second service nodes connected through the first convergence network device (convergence switch 40 in FIG. 8 ). In Figure 8, the black-filled dots are the service nodes that are determined as the candidate root cause of the fault, and the white-filled dots are the service nodes that are not determined as the candidate root cause of the fault. It can be known from FIG. 8 that if there are normal nodes among all the second service nodes connected to the first convergence network device, it can be determined that the first access network device is the root cause of the failure.

步骤5036、当链路状态指示第二服务节点均为候选故障根因时，管理节点基于组网信息，获取第一汇聚网络设备连接的第一核心网络设备，及第一核心网络设备连接的其它汇聚网络设备连接的第三服务节点，第三服务节点为第一核心网络设备连接的除目标服务节点外的服务节点。Step 5036: When the link status indicates that the second service node is a candidate for the root cause of the failure, the management node obtains the first core network device connected to the first convergence network device and other nodes connected to the first core network device based on the networking information. A third service node connected to the aggregation network device, where the third service node is a service node other than the target service node connected to the first core network device.

由于与同一汇聚网络设备连接的所有服务节点同时出现故障的概率较低，因此，当链路状态指示第一汇聚网络设备连接的所有第二服务节点均为候选故障根因时，可以确定是该第一汇聚网络设备或向该第一汇聚网络设备提供网络的网络设备出现故障。则管理节点可以基于组网信息，获取第一汇聚网络设备连接的第一核心网络设备，及第一核心网络设备连接的其它汇聚网设备连接的第三服务节点，以便于根据其确定故障根因。Since the probability that all service nodes connected to the same aggregation network device fail at the same time is low, when the link status indicates that all the second service nodes connected to the first aggregation network device are candidate root causes of failure, it can be determined that the The first convergent network device or a network device providing a network to the first converged network device fails. Then the management node can obtain the first core network device connected to the first aggregation network device and the third service node connected to other aggregation network devices connected to the first core network device based on the networking information, so as to determine the root cause of the failure based on them .

步骤5037、当链路状态指示第三服务节点包括正常节点时，管理节点确定第一汇聚网络设备为故障根因。Step 5037: When the link state indicates that the third serving node includes a normal node, the management node determines that the first convergence network device is the root cause of the fault.

在确定第一汇聚网络设备连接的第一核心网络设备，及第一核心网络设备连接的所有第三服务节点后，可以获取第一核心网络设备连接的所有第三服务节点的链路状态，并根据每个第三服务节点的链路状态确定第三服务节点的链路是否正常。第三服务节点的链路正常时，说明该第三服务节点能够通过该链路进行数据传输，则该第三服务节点为正常节点。并且，若第一核心网络设备为故障根因，该第一核心网络设备连接的所有第三服务节点均无法正常工作。因此，当链路状态指示第一核心网络设备连接的所有第三服务节点中存在正常节点时，可以确定该第一核心网络设备处于正常状态，进而确定是第一汇聚网络设备自身出现问题，即确定第一汇聚网络设备为故障根因。此时，服务集群出现的故障可称为汇聚网络设备故障。After determining the first core network device connected to the first aggregation network device and all third service nodes connected to the first core network device, the link status of all third service nodes connected to the first core network device may be obtained, and Whether the link of the third service node is normal is determined according to the link state of each third service node. When the link of the third service node is normal, it means that the third service node can perform data transmission through the link, and the third service node is a normal node. Moreover, if the first core network device is the root cause of the fault, all third service nodes connected to the first core network device cannot work normally. Therefore, when the link state indicates that there are normal nodes in all the third service nodes connected to the first core network device, it can be determined that the first core network device is in a normal state, and then it is determined that the first aggregation network device itself has a problem, that is, Determine the first aggregation network device as the root cause of the fault. At this time, the failure of the service cluster can be called the aggregation network device failure.

如图9所示，在该步骤5037中，当前候选故障根因为第一汇聚网络设备(图9中为汇聚交换机40)，则管理节点可以根据拓扑图，至少展示该第一汇聚网络设备、该第一汇聚网络设备连接的接入网络设备(图9中为接入交换机30)、服务节点、第一核心网络设备(图9中为核心交换机50)、及通过该第一核心网络设备连接的所有第三服务节点。图9中黑色填充的圆点为被确定为候选故障根因的服务节点，白色填充的圆点为未被确定为候选故障根因的服务节点。根据图9可知，第一核心网络设备连接的所有第三服务节点中存在正常节点，则可以确定第一汇聚网络设备为故障根因。As shown in FIG. 9, in step 5037, the current candidate failure root is caused by the first convergence network device (convergence switch 40 in FIG. 9), and the management node can at least display the first convergence network device, the The access network device (the access switch 30 in FIG. 9 ), the service node, the first core network device (the core switch 50 in FIG. 9 ) connected to the first aggregation network device, and the network devices connected through the first core network device All third service nodes. In Figure 9, the black-filled dots are the service nodes that are determined as the candidate root cause of the fault, and the white-filled dots are the service nodes that are not determined as the candidate root cause of the fault. It can be seen from FIG. 9 that if there are normal nodes among all the third service nodes connected to the first core network device, it can be determined that the first convergence network device is the root cause of the fault.

步骤5038、当链路状态指示第三服务节点均为候选故障根因时，管理节点确定第一核心网络设备为故障根因。Step 5038: When the link state indicates that the third service node is a candidate root cause of the fault, the management node determines that the first core network device is the root cause of the fault.

由于与同一核心网络设备连接的所有服务节点同时出现故障的概率较低，因此，当链路状态指示第一核心网络设备连接的所有第三服务节点均为候选故障根因时，可以确定是该第一核心网络设备出现故障，即确定第一核心网络设备为故障根因。此时，服务集群出现的故障可称为核心网络设备故障。其中，服务节点故障、接入网络设备故障、汇聚网络设备故障和核心网络设备故障的影响面依次增大。Since the probability that all service nodes connected to the same core network device fail at the same time is low, when the link status indicates that all the third service nodes connected to the first core network device are candidate root causes of the failure, it can be determined that the When the first core network device fails, it is determined that the first core network device is the root cause of the fault. At this time, the failure of the service cluster can be called the failure of the core network equipment. Among them, service node failures, access network equipment failures, aggregation network equipment failures, and core network equipment failures have an increasing impact.

需要说明的是，上述步骤5031至步骤5038为对基于链路状态和组网信息进行汇聚分析，确定故障根因的实现过程的举例说明，并不用于限定其实现方式。在不同的应用需求下，其实现方式可以适当改变。例如，在确定一个网络层级中的设备为候选故障根因后，可以获取该候选故障根因连接的上一层级网络设备连接的其它服务节点。然后，在所有处于正常状态的其它服务节点中分别选择第一指定数量个其它服务节点和第二指定数量个其它服务节点，并在该候选故障根因的服务节点中选择第三指定数量个服务节点，然后将第一指定数量个其它服务节点作为目的节点，分别将第二指定数量个其它服务节点和第三指定数量个服务节点作为源节点，然后确定每个目的节点和每个源节点之间的链路状态，并根据上述步骤5031至步骤5038的逻辑确定候选故障根因是否为故障根因。通过选择指定数量的服务节点和其它服务节点，并根据选择的服务节点和其它服务节点之间的链路状态确定故障根因，能够减少汇聚分析需要处理的数量，进一步提高对故障根因进行定位的效率。It should be noted that the above steps 5031 to 5038 are examples of the implementation process of determining the root cause of the fault based on the aggregation analysis based on the link status and networking information, and are not intended to limit the implementation method. Under different application requirements, its implementation manner can be appropriately changed. For example, after determining that a device in a network level is a candidate root cause of a fault, other service nodes connected to a network device in an upper layer connected to the candidate root cause of the fault may be obtained. Then, select the first specified number of other service nodes and the second specified number of other service nodes among all other service nodes in normal state, and select the third specified number of service nodes among the service nodes of the candidate failure root cause node, and then use the first specified number of other service nodes as destination nodes, respectively use the second specified number of other service nodes and the third specified number of service nodes as source nodes, and then determine the relationship between each destination node and each source node and determine whether the candidate root cause of the fault is the root cause of the fault according to the logic of steps 5031 to 5038. By selecting a specified number of service nodes and other service nodes, and determining the root cause of the fault based on the link status between the selected service node and other service nodes, it is possible to reduce the amount of processing required for aggregation analysis and further improve the location of the root cause of the fault s efficiency.

其中，当候选故障根因为服务节点时，候选故障根因的服务节点为服务节点。当候选故障根因为向服务节点提供网络的网络设备时，候选故障根因的服务节点为与该网络设备连接的服务节点。服务节点和为其提供网络的网络设备的层次，按照到服务节点的逻辑距离的增加依次升高。例如，对于服务节点、接入网络设备、汇聚网络设备和核心网络设备，由于服务节点、接入网络设备、汇聚网络设备和核心网络设备到服务节点的逻辑距离依次增大，则服务节点、接入网络设备、汇聚网络设备和核心网络设备的层次依次升高。Wherein, when the candidate root cause of the fault is a service node, the service node of the candidate root cause of the fault is the service node. When the candidate root cause of the fault is a network device that provides a network to the service node, the service node of the candidate root cause of the fault is the service node connected to the network device. The hierarchy of the service node and the network equipment that provides the network for it increases sequentially according to the increase of the logical distance to the service node. For example, for service nodes, access network devices, aggregation network devices, and core network devices, since the logical distances from service nodes, access network devices, aggregation network devices, and core network devices to service nodes increase sequentially, the service nodes, access network The levels of ingress network devices, aggregation network devices, and core network devices increase sequentially.

示例的，如图10所示，假设汇聚网络设备(图10中为汇聚交换机40)为候选故障根因，可以获取该汇聚网络设备连接的核心网络设备(图10中为核心交换机50)连接的其它服务节点。然后，在所有处于正常状态的其它服务节点中分别选择第一指定数量个(图10中为1个)其它服务节点和第二指定数量个(图10中为1个)其它服务节点，并在该汇聚网络设备连接的服务节点中选择第三指定数量个(图10中为2个)服务节点，然后将第一指定数量个其它服务节点作为目的节点，分别将第二指定数量个其它服务节点和第三指定数量个服务节点作为源节点，然后确定每个目的节点和每个源节点之间的链路状态(图10中虚线箭头上的Y表示链路状态正常，N表示链路状态异常)，并根据上述步骤5031至步骤5038的逻辑确定汇聚网络设备是否为故障根因。Exemplarily, as shown in FIG. 10 , assuming that the aggregation network device (aggregation switch 40 in FIG. 10 ) is a candidate root cause of failure, the core network device (core switch 50 in FIG. 10 ) connected to the aggregation network device can be obtained. other service nodes. Then, select the first specified number (1 in FIG. 10 ) of other service nodes and the second specified number (1 in FIG. 10 ) of other service nodes among all other service nodes in the normal state, and Select the third specified number of (2 in Figure 10) service nodes among the service nodes connected to the converging network device, then use the first specified number of other service nodes as destination nodes, and use the second specified number of other service nodes as destination nodes respectively and the third specified number of service nodes as source nodes, then determine the link status between each destination node and each source node (Y on the dashed arrow in Figure 10 indicates that the link status is normal, and N indicates that the link status is abnormal ), and determine whether the aggregation network device is the root cause of the failure according to the logic of the above step 5031 to step 5038.

综上所述，在本申请实施例提供的故障根因定位方法中，通过基于链路状态和组网信息进行汇聚分析，能够在全局范围内自动地进行故障根因定位，能够快速定位到故障根因，提高了对故障根因进行定位的效率，降低了因故障对服务产生影响的概率。To sum up, in the method for locating the root cause of a fault provided in the embodiment of the present application, through aggregation and analysis based on link status and networking information, the root cause of the fault can be automatically located globally, and the fault can be quickly located. The root cause improves the efficiency of locating the root cause of the fault and reduces the probability of service impact due to faults.

需要说明的是，本申请实施例提供的故障根因定位方法的步骤先后顺序可以进行适当调整，步骤也可以根据情况进行相应增减。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化的方法，都应涵盖在本申请的保护范围之内，因此不再赘述。It should be noted that the sequence of steps in the method for locating the root cause of a fault provided in the embodiment of the present application can be appropriately adjusted, and the steps can also be increased or decreased according to the situation. Any person skilled in the art within the technical scope disclosed in this application can easily think of changes, which should be covered within the scope of protection of this application, and thus will not be repeated here.

以上介绍了本申请实施例的故障根因定位方法，与上述方法对应，本申请实施例还提供了故障根因定位装置。图11是本申请实施例提供的一种故障根因定位装置的结构示意图。基于图11所示的如下多个模块，该图11所示的故障根因定位装置能够执行上述图5所示的全部或部分操作。应理解到，该装置可以包括比所示模块更多的附加模块或者省略其中所示的一部分模块，本申请实施例对此并不进行限制。该故障根因定位装置应用于对服务集群进行管理的管理节点，服务集群包括多个服务节点，服务节点用于实现用户业务。如图11所示，故障根因定位装置110包括：The method for locating the root cause of a fault in the embodiment of the present application is described above. Corresponding to the above method, the embodiment of the present application also provides a device for locating the root cause of a fault. Fig. 11 is a schematic structural diagram of a device for locating the root cause of a fault provided by an embodiment of the present application. Based on the following multiple modules shown in FIG. 11 , the fault root cause location device shown in FIG. 11 can perform all or part of the operations shown in FIG. 5 above. It should be understood that the device may include more additional modules than those shown or omit some of the modules shown therein, which is not limited in this embodiment of the present application. The device for locating the root cause of a fault is applied to a management node that manages a service cluster. The service cluster includes a plurality of service nodes, and the service nodes are used to implement user services. As shown in Figure 11, the fault root cause location device 110 includes:

获取模块1101，用于获取多个服务节点的组网信息。An acquisition module 1101, configured to acquire networking information of multiple service nodes.

获取模块1101，还用于在服务集群出现故障时，获取多个服务节点之间链路的链路状态。The acquiring module 1101 is further configured to acquire link states of links between multiple service nodes when the service cluster fails.

处理模块1102，用于基于链路状态和组网信息进行汇聚分析，确定故障根因。The processing module 1102 is configured to perform aggregation analysis based on the link status and networking information to determine the root cause of the fault.

可选的，多个服务节点均通过位于接入层的接入网络设备接入网络。则处理模块1102，具体用于：当无法获取目标服务节点与其它服务节点之间链路的链路状态，或者，来自其它服务节点的链路状态指示与目标服务节点断链时，确定目标服务节点为候选故障根因；基于组网信息，获取目标服务节点连接的第一接入网络设备，及第一接入网络设备连接的第一服务节点，第一服务节点为第一接入网络设备连接的除目标服务节点外的服务节点；当链路状态指示第一服务节点包括正常节点时，确定目标服务节点为故障根因。Optionally, multiple service nodes access the network through an access network device located at the access layer. The processing module 1102 is specifically used to determine the target service node when the link status of the link between the target service node and other service nodes cannot be obtained, or when the link status indication from other service nodes is disconnected from the target service node. The node is the candidate root cause of the failure; based on the networking information, the first access network device connected to the target service node and the first service node connected to the first access network device are obtained, and the first service node is the first access network device Connected service nodes other than the target service node; when the link state indicates that the first service node includes a normal node, determine that the target service node is the root cause of the failure.

可选的，接入网络设备通过位于汇聚层的汇聚网络设备接入网络。则处理模块1102，具体用于：当链路状态指示第一服务节点均为候选故障根因时，基于组网信息，获取第一接入网络设备连接的第一汇聚网络设备，及第一汇聚网络设备连接的其它接入网设备连接的第二服务节点，第二服务节点为第一汇聚网络设备连接的除目标服务节点外的服务节点；当链路状态指示第二服务节点包括正常节点时，确定第一接入网络设备为故障根因。Optionally, the access network device accesses the network through the convergence network device located at the convergence layer. Then the processing module 1102 is specifically configured to: when the link status indicates that the first service nodes are all candidate root causes of failure, based on the networking information, obtain the first convergence network device connected to the first access network device, and the first convergence network device. The second service node connected to other access network devices connected to the network device, the second service node is a service node other than the target service node connected to the first aggregation network device; when the link status indicates that the second service node includes a normal node , and determine that the first access network device is the root cause of the fault.

可选的，汇聚网络设备通过位于核心层的核心网络设备接入网络。则处理模块1102，具体用于：当链路状态指示第二服务节点均为候选故障根因时，基于组网信息，获取第一汇聚网络设备连接的第一核心网络设备，及第一核心网络设备连接的其它汇聚网络设备连接的第三服务节点，第三服务节点为第一核心网络设备连接的除目标服务节点外的服务节点；当链路状态指示第三服务节点包括正常节点时，确定第一汇聚网络设备为故障根因。Optionally, the aggregation network device accesses the network through the core network device at the core layer. The processing module 1102 is specifically configured to: when the link state indicates that the second service node is a candidate root cause of failure, based on the networking information, obtain the first core network device connected to the first convergence network device, and the first core network device The third service node connected to other aggregation network devices connected to the device, the third service node is a service node other than the target service node connected to the first core network device; when the link status indicates that the third service node includes a normal node, determine The first aggregation network device is the root cause of the fault.

可选的，获取模块1101，具体用于：接收每个服务节点提供的服务节点与其它服务节点之间链路的链路状态。Optionally, the obtaining module 1101 is specifically configured to: receive link statuses of links between the service node and other service nodes provided by each service node.

可选的，获取模块1101，具体用于：在多个服务节点中确定多个待测服务节点；向每个服务节点提供多个待测服务节点的信息，使得每个服务节点获取服务节点与每个待测服务节点之间链路的链路状态；接收每个服务节点提供的服务节点与每个待测服务节点之间链路的链路状态。Optionally, the acquisition module 1101 is specifically configured to: determine a plurality of service nodes to be tested among the plurality of service nodes; provide each service node with information on a plurality of service nodes to be tested, so that each service node obtains the The link status of the link between each service node to be tested; receiving the link status of the link between the service node and each service node to be tested provided by each service node.

综上所述，在本申请实施例提供的故障根因定位装置中，通过基于链路状态和组网信息进行汇聚分析，能够在全局范围内自动地进行故障根因定位，能够快速定位到故障根因，提高了对故障根因进行定位的效率，降低了因故障对服务产生影响的概率。To sum up, in the device for locating the root cause of the fault provided in the embodiment of the present application, by performing aggregation analysis based on the link state and networking information, the root cause of the fault can be automatically located globally, and the fault can be quickly located. The root cause improves the efficiency of locating the root cause of the fault and reduces the probability of service impact due to faults.

其中，获取模块1101和处理模块1102均可以通过软件实现，或者可以通过硬件实现。示例性地，接下来以获取模块1101为例，介绍获取模块1101的实现方式。类似的，处理模块1102的实现方式可以参考获取模块1101的实现方式。Wherein, both the obtaining module 1101 and the processing module 1102 can be implemented by software, or can be implemented by hardware. Exemplarily, the following takes the obtaining module 1101 as an example to introduce the implementation of the obtaining module 1101 . Similarly, the implementation manner of the processing module 1102 may refer to the implementation manner of the acquisition module 1101 .

模块作为软件功能单元的一种举例，获取模块1101可以包括运行在计算实例上的代码。其中，计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地，上述计算实例可以是一台或者多台。例如，获取模块1101可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是，用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中，也可以分布在不同的region中。进一步地，用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone，AZ)中，也可以分布在不同的AZ中，每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中，通常一个region可以包括多个AZ。A module is an example of a software functional unit, and the acquisition module 1101 may include codes running on computing instances. Wherein, the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Further, the above computing instance may be one or more. For example, the acquisition module 1101 may include code running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region), or in different regions. Further, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (Availability Zone, AZ), or in different AZs, and each AZ includes one data center or multiple geographically close data centers. Among them, usually a region can include multiple AZs.

同样，用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud，VPC)中，也可以分布在多个VPC中。其中，通常一个VPC设置在一个region内，同一region内两个VPC之间，以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关，经通信网关实现VPC之间的互连。Similarly, multiple hosts/virtual machines/containers for running the code can be distributed in the same virtual private cloud (virtual private cloud, VPC), or in multiple VPCs. Among them, usually a VPC is set in a region, and cross-region communication between two VPCs in the same region and between VPCs in different regions needs to set up a communication gateway in each VPC, and realize the interconnection between VPCs through the communication gateway. .

模块作为硬件功能单元的一种举例，获取模块1101可以包括至少一个计算设备，如服务器等。或者，获取模块1101也可以是利用专用集成电路(application-specificintegrated circuit，ASIC)实现、或可编程逻辑器件(programmable logic device，PLD)实现的设备等。其中，上述PLD可以是复杂程序逻辑器件(complex programmable logicaldevice，CPLD)、现场可编程门阵列(field-programmable gate array，FPGA)、通用阵列逻辑(generic array logic，GAL)或其任意组合实现。A module is an example of a hardware functional unit, and the obtaining module 1101 may include at least one computing device, such as a server. Alternatively, the acquisition module 1101 may also be implemented by using an application-specific integrated circuit (application-specific integrated circuit, ASIC), or a device implemented by a programmable logic device (programmable logic device, PLD). Wherein, the above-mentioned PLD can be realized by complex programmable logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL) or any combination thereof.

获取模块1101包括的多个计算设备可以分布在相同的region中，也可以分布在不同的region中。获取模块1101包括的多个计算设备可以分布在相同的AZ中，也可以分布在不同的AZ中。同样，获取模块1101包括的多个计算设备可以分布在同一个VPC中，也可以分布在多个VPC中。其中，所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。Multiple computing devices included in the acquisition module 1101 may be distributed in the same region, or in different regions. Multiple computing devices included in the acquisition module 1101 may be distributed in the same AZ, or in different AZs. Likewise, multiple computing devices included in the obtaining module 1101 may be distributed in the same VPC, or may be distributed in multiple VPCs. Wherein, the plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

需要说明的是，在其他实施例中，获取模块1101和处理模块1102中任一模块可以用于执行故障根因定位方法中的任意步骤。获取模块1101和处理模块1102负责实现的步骤可根据需要指定，通过获取模块1101和处理模块1102分别实现故障根因定位方法中不同的步骤来实现故障根因定位装置的全部功能。It should be noted that, in other embodiments, any one of the acquisition module 1101 and the processing module 1102 may be used to execute any step in the fault root cause location method. The steps that the acquisition module 1101 and the processing module 1102 are responsible for can be specified according to the needs. The acquisition module 1101 and the processing module 1102 respectively implement different steps in the fault root cause location method to realize all functions of the fault root cause location device.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的装置和模块的具体工作过程，可以参考前述方法实施例中的对应内容，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described devices and modules can refer to the corresponding content in the foregoing method embodiments, which will not be repeated here.

本申请实施例提供了一种计算设备。该计算设备用于实现本申请实施例提供的故障根因定位方法中的部分或全部功能。图12是本申请实施例提供的一种计算设备的结构示意图。如图12所示，该计算设备1200包括处理器1201、存储器1202、通信接口1203和总线1204。其中，处理器1201、存储器1202、通信接口1203通过总线1204实现彼此之间的通信连接。An embodiment of the present application provides a computing device. The computing device is used to implement part or all of the functions in the method for locating the root cause of a fault provided in the embodiment of the present application. FIG. 12 is a schematic structural diagram of a computing device provided by an embodiment of the present application. As shown in FIG. 12 , the computing device 1200 includes a processor 1201 , a memory 1202 , a communication interface 1203 and a bus 1204 . Wherein, the processor 1201 , the memory 1202 , and the communication interface 1203 are connected to each other through a bus 1204 .

处理器1201可以包括通用处理器和/或专用硬件芯片。通用处理器可以包括：中央处理器(central processing unit，CPU)、微处理器或图形处理器(graphics processingunit，GPU)。CPU例如是一个单核处理器(single-CPU)，又如是一个多核处理器(multi-CPU)。专用硬件芯片是一个高性能处理的硬件模块。专用硬件芯片包括数字信号处理器、专用集成电路(application-specific integrated circuit，ASIC)、现场可编程逻辑门阵列(field-programmable gate array，FPGA)或者网络处理器(network processer，NP)中的至少一项。处理器1201还可以是一种集成电路芯片，具有信号的处理能力。在实现过程中，本申请的故障根因定位方法的部分或全部功能，可以通过处理器1201中的硬件的集成逻辑电路或者软件形式的指令完成。Processor 1201 may include a general-purpose processor and/or a dedicated hardware chip. A general-purpose processor may include: a central processing unit (central processing unit, CPU), a microprocessor, or a graphics processing unit (graphics processing unit, GPU). The CPU is, for example, a single-core processor (single-CPU), or a multi-core processor (multi-CPU). The dedicated hardware chip is a high-performance processing hardware module. Dedicated hardware chips include at least one of digital signal processors, application-specific integrated circuits (ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or network processors (network processors, NPs). one item. The processor 1201 may also be an integrated circuit chip, which has a signal processing capability. During implementation, part or all of the functions of the fault root cause locating method of the present application may be implemented by an integrated logic circuit of hardware in the processor 1201 or instructions in the form of software.

存储器1202用于存储计算机程序，计算机程序包括操作系统1202a和可执行代码(即程序指令)1202b。存储器1202例如是只读存储器或可存储静态信息和指令的其它类型的静态存储设备，又如是随机存取存储器或者可存储信息和指令的其它类型的动态存储设备，又如是电可擦可编程只读存储器、只读光盘或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备，或者是能够用于携带或存储具有指令或数据结构形式的期望的可执行代码并能够由计算机存取的任何其它介质，但不限于此。例如存储器1202用于存放出端口队列等。存储器1202例如是独立存在，并通过总线1204与处理器1201相连接。或者存储器1202和处理器1201集成在一起。存储器1202可以存储可执行代码，当存储器1202中存储的可执行代码被处理器1201执行时，处理器1201用于执行本申请实施例提供的故障根因定位方法的部分或全部功能。处理器1201执行该过程的实现方式请相应参考前述实施例中的相关描述。存储器1202中还可以包括操作系统等其他运行进程所需的软件模块和数据等。The memory 1202 is used to store computer programs, and the computer programs include an operating system 1202a and executable codes (ie, program instructions) 1202b. The memory 1202 is, for example, a read-only memory or other types of static storage devices that can store static information and instructions, or a random access memory or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable only read-only disc or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage medium or other magnetic storage device, or is capable of carrying or storing Desired executable code in the form of instructions or data structures and any other medium capable of being accessed by a computer, without limitation. For example, the memory 1202 is used to store out port queues and the like. The memory 1202 exists independently, for example, and is connected to the processor 1201 through the bus 1204 . Or the memory 1202 and the processor 1201 are integrated together. The memory 1202 may store executable codes. When the executable codes stored in the memory 1202 are executed by the processor 1201, the processor 1201 is configured to perform some or all functions of the fault root cause location method provided by the embodiment of the present application. For an implementation manner of the processor 1201 executing the process, please refer to relevant descriptions in the foregoing embodiments. The memory 1202 may also include software modules and data required by other running processes such as an operating system.

通信接口1203使用例如但不限于收发器一类的收发模块，来实现与其他设备或通信网络之间的通信。例如，通信接口1203可以是以下器件的任一种或任一种组合：网络接口(如以太网接口)、无线网卡等具有网络接入功能的器件。The communication interface 1203 uses a transceiver module such as but not limited to a transceiver to realize communication with other devices or a communication network. For example, the communication interface 1203 may be any one or any combination of the following devices: a network interface (such as an Ethernet interface), a wireless network card and other devices with network access functions.

总线1204是任何类型的，用于实现计算设备的内部器件(例如，存储器1202、处理器1201、通信接口1203)互连的通信总线。例如系统总线。本申请实施例以计算设备内部的上述器件通过总线1204互连为例说明，可选地，计算设备1200内部的上述器件还可以采用除了总线1204之外的其他连接方式彼此通信连接。例如，计算设备1200内部的上述器件通过内部的逻辑接口互连。The bus 1204 is any type of communication bus used to interconnect the internal devices of the computing device (eg, the memory 1202, the processor 1201, the communication interface 1203). For example the system bus. In this embodiment of the present application, the above-mentioned devices inside the computing device are interconnected through the bus 1204 as an example for illustration. Optionally, the above-mentioned devices inside the computing device 1200 may also be communicatively connected to each other using other connection methods than the bus 1204 . For example, the above components inside the computing device 1200 are interconnected through internal logic interfaces.

需要说明的是，上述多个器件可以分别设置在彼此独立的芯片上，也可以至少部分的或者全部的设置在同一块芯片上。将各个器件独立设置在不同的芯片上，还是整合设置在一个或者多个芯片上，往往取决于产品设计的需要。本申请实施例对上述器件的具体实现形式不做限定。且上述各个附图对应的流程的描述各有侧重，某个流程中没有详述的部分，可以参见其他流程的相关描述。It should be noted that the above-mentioned multiple devices may be respectively disposed on independent chips, or may be at least partly or completely disposed on the same chip. Whether each device is independently arranged on different chips or integrated and arranged on one or more chips often depends on the needs of product design. The embodiments of the present application do not limit the specific implementation forms of the foregoing devices. In addition, the descriptions of the processes corresponding to the above-mentioned figures have their own emphasis. For the parts not described in detail in a certain process, you can refer to the relevant descriptions of other processes.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。提供程序开发平台的计算机程序产品包括一个或多个计算机指令，在计算设备上加载和执行这些计算机程序指令时，全部或部分地实现本申请实施例提供的故障根因定位方法的功能。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product that provides the program development platform includes one or more computer instructions. When these computer program instructions are loaded and executed on the computing device, all or part of the function of the fault root cause location method provided by the embodiment of the present application is realized.

并且，计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质存储有提供程序开发平台的计算机程序指令。Also, computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, computer instructions may be transferred from a website, computer, server, or data center via a wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.) to another website site, computer, server or data center. A computer readable storage medium stores computer program instructions providing a program development platform.

本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器，例如是中心服务器、边缘服务器，或者是本地数据中心中的本地服务器。在一些实施例中，计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。The embodiment of the present application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop computer, a notebook computer, or a smart phone.

可选地，计算设备集群包括的至少一个计算设备的结构可参见图1示出的计算设备1200。计算设备集群中的一个或多个计算设备1200中的存储器1202中可以存有相同的用于执行故障根因定位方法的指令。Optionally, the structure of at least one computing device included in the computing device cluster may refer to the computing device 1200 shown in FIG. 1 . The memory 1202 in one or more computing devices 1200 in the computing device cluster may store the same instructions for executing the fault root cause location method.

在一些可能的实现方式中，该计算设备集群中的一个或多个计算设备1200的存储器1202中也可以分别存有用于执行故障根因定位方法的部分指令。换言之，一个或多个计算设备1200的组合可以共同执行用于执行故障根因定位方法的指令。In some possible implementation manners, the memories 1202 of one or more computing devices 1200 in the computing device cluster may respectively store some instructions for executing the fault root cause location method. In other words, a combination of one or more computing devices 1200 can jointly execute instructions for performing the fault root cause location method.

需要说明的是，计算设备集群中的不同的计算设备1200中的存储器1202可以存储不同的指令，分别用于执行故障根因定位装置的部分功能。也即，不同的计算设备1200中的存储器1202存储的指令可以实现获取模块、确定模块和分组模块中的一个或多个模块的功能。It should be noted that the memories 1202 in different computing devices 1200 in the computing device cluster may store different instructions, which are respectively used to execute part of the functions of the fault root cause locating apparatus. That is to say, the instructions stored in the memory 1202 in different computing devices 1200 may implement the functions of one or more of the obtaining module, the determining module and the grouping module.

在一些可能的实现方式中，计算设备集群中的一个或多个计算设备可以通过网络连接。其中，所述网络可以是广域网或局域网等等。图13示出了一种可能的实现方式。如图13所示，两个计算设备1300A和1300B之间通过网络进行连接。具体地，通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中，计算设备1300A和1300B包括总线1302、处理器1304、存储器1306和通信接口1308。计算设备1300A中的存储器1306中存有执行处理模块的功能的指令。同时，计算设备1300B中的存储器1306中存有执行获取模块的功能的指令。In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein, the network may be a wide area network or a local area network or the like. Figure 13 shows a possible implementation. As shown in FIG. 13 , two computing devices 1300A and 1300B are connected through a network. Specifically, it is connected to the network through a communication interface in each computing device. In this class of possible implementations, computing devices 1300A and 1300B include a bus 1302 , a processor 1304 , a memory 1306 , and a communication interface 1308 . The memory 1306 in the computing device 1300A stores therein instructions to perform the functions of the processing modules. Meanwhile, the memory 1306 in the computing device 1300B stores instructions for executing the function of the acquisition module.

应理解，图13中示出的计算设备1300A的功能也可以由多个计算设备1300完成。同样，计算设备1300B的功能也可以由多个计算设备1300完成。且用于实现故障根因定位方法的模块在计算设备中的部署方式也可以根据应用需求进行调整。It should be understood that the functions of the computing device 1300A shown in FIG. 13 may also be performed by multiple computing devices 1300 . Likewise, the functions of computing device 1300B may also be performed by multiple computing devices 1300 . Moreover, the deployment mode of the modules used to implement the method for locating the root cause of the fault in the computing device can also be adjusted according to application requirements.

本申请实施例还提供了一种计算机可读存储介质，该计算机可读存储介质为非易失性计算机可读存储介质，该计算机可读存储介质包括程序指令，当程序指令在计算设备上运行时，使得计算设备实现如本申请实施例提供的故障根因定位方法。The embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium is a non-volatile computer-readable storage medium. The computer-readable storage medium includes program instructions. When the program instructions are run on the computing device , make the computing device implement the method for locating the root cause of the fault as provided in the embodiment of the present application.

本申请实施例还提供了一种包含指令的计算机程序产品，当计算机程序产品在计算机上运行时，使得计算机实现本申请实施例提供的故障根因定位方法。The embodiment of the present application also provides a computer program product containing instructions, and when the computer program product is run on the computer, the computer implements the method for locating the root cause of the fault provided in the embodiment of the present application.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

需要说明的是，本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号，均为经用户授权或者经过各方充分授权的，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如，本申请中涉及到的组网信息和链路状态等都是在充分授权的情况下获取的。It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.) and signals involved in this application, All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of the relevant countries and regions. For example, the networking information and link status mentioned in this application are obtained under the condition of full authorization.

在本申请实施例中，术语“第一”、“第二”和“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。术语“至少一个”是指一个或多个，术语“多个”指两个或两个以上，除非另有明确的限定。In the embodiments of the present application, the terms "first", "second" and "third" are used for description purposes only, and cannot be understood as indicating or implying relative importance. The term "at least one" means one or more, and the term "plurality" means two or more, unless otherwise clearly defined.

本申请中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。The term "and/or" in this application is only an association relationship describing associated objects, which means that there may be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and A and B exist alone. There are three cases of B. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.

以上所述仅为本申请的可选实施例，并不用以限制本申请，凡在本申请的构思和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above are only optional embodiments of the application, and are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the concept and principles of the application shall be included in the protection of the application. within range.

Claims

1. A method for locating the root cause of a fault, wherein the method is applied to a management node that manages a service cluster, the service cluster includes a plurality of service nodes, and the service nodes are used to implement user services, the Methods include:

Acquiring networking information of the plurality of service nodes;

When the service cluster fails, acquire the link status of the links between the multiple service nodes;

Aggregate analysis is performed based on the link state and the networking information to determine the root cause of the fault.

2. The method according to claim 1, wherein the plurality of service nodes access the network through an access network device located at the access layer, and the Perform aggregate analysis to determine the root cause of failures, including:

When the link status of the link between the target service node and other service nodes cannot be obtained, or when the link status indication from other service nodes is disconnected from the target service node, determine that the target service node is a candidate failure root because;

Based on the networking information, obtain the first access network device connected to the target service node and the first service node connected to the first access network device, where the first service node is the first access network device A service node other than the target service node connected to the inbound network device;

When the link state indicates that the first service node includes a normal node, determine that the target service node is the root cause of the failure.

3. The method according to claim 2, wherein the access network device accesses the network through a convergence network device located at the convergence layer, and the convergence analysis is performed based on the link state and the networking information , to determine the root cause of the failure, including:

When the link state indicates that the first service node is a candidate root cause of the failure, based on the networking information, obtain the first aggregation network device connected to the first access network device, and the first A second service node connected to other access network devices connected to the convergence network device, where the second service node is a service node other than the target service node connected to the first convergence network device;

When the link state indicates that the second serving node includes a normal node, determine that the first access network device is the root cause of the failure.

4. The method according to claim 3, wherein the converged network device is connected to the network through a core network device located at the core layer, and the converged analysis is performed based on the link state and the networking information, Determine the root cause of the failure, including:

When the link state indicates that the second service node is a candidate root cause of failure, based on the networking information, obtain the first core network device connected to the first aggregation network device, and the first core A third service node connected to other aggregation network devices connected to the network device, where the third service node is a service node connected to the first core network device except the target service node;

When the link status indicates that the third serving node includes a normal node, determine that the first aggregation network device is the root cause of the failure.

5. The method according to claim 4, wherein said performing aggregation analysis based on said link state and said networking information to determine the root cause of the failure further comprises:

When the link state indicates that all the third service nodes are candidate root causes of the failure, determine that the first core network device is the root cause of the failure.

6. The method according to any one of claims 1 to 5, wherein the acquiring the link status of the links between the multiple service nodes comprises:

The link status of links between the service node and other service nodes provided by each service node is received.

7. The method according to claim 6, wherein, before acquiring the link states of the links between the plurality of service nodes, the method further comprises:

determining a plurality of service nodes to be tested among the plurality of service nodes;

providing each service node with the information of the plurality of service nodes to be tested, so that each service node obtains the link state of the link between the service node and each service node to be tested;

The receiving the link state of the link between the service node and other service nodes provided by each service node includes:

Receive the link state of the link between the service node and each service node to be tested provided by each service node.

8. The method according to claim 7, wherein the network range of the plurality of service nodes to be tested covers the network range of the service cluster.

9. The method according to any one of claims 1 to 8, wherein the link status is reflected by one or more of the following: connectivity status and transmission delay of the link.

10. A device for locating the root cause of a fault, wherein the device is applied to a management node that manages a service cluster, the service cluster includes a plurality of service nodes, and the service nodes are used to implement user services, the Devices include:

An acquisition module, configured to acquire networking information of the plurality of service nodes;

The obtaining module is further configured to obtain the link status of the links between the multiple service nodes when the service cluster fails;

A processing module, configured to perform aggregation analysis based on the link state and the networking information to determine the root cause of the fault.

11. The device according to claim 10, wherein the plurality of service nodes access the network through an access network device located at the access layer, and the processing module is specifically used for:

12. The device according to claim 11, wherein the access network device accesses the network through a convergence network device located at the convergence layer, and the processing module is specifically used for:

13. The apparatus according to claim 12, wherein the convergence network device accesses the network through a core network device located at the core layer, and the processing module is specifically used for:

14. The device according to claim 13, wherein the processing module is specifically used for:

15. The device according to any one of claims 10 to 14, wherein the acquisition module is specifically used for:

16. The device according to claim 15, wherein the acquiring module is specifically used for:

17. The apparatus according to claim 16, wherein the network range of the plurality of service nodes to be tested covers the network range of the service cluster.

18. The device according to any one of claims 10 to 17, wherein the link status is reflected by one or more of the following: connectivity status and transmission delay of the link.

19. A computing device cluster, characterized in that it includes at least one computing device, each computing device includes a processor and a memory, and the processor of the at least one computing device is used to execute the program stored in the memory of the at least one computing device instructions, so that the cluster of computing devices executes the method according to any one of claims 1-9.

20. A computer program product containing instructions, wherein when the instructions are executed by a cluster of computing devices, the cluster of computing devices executes the method according to any one of claims 1 to 9.

21. A computer-readable storage medium, characterized by comprising computer program instructions, and when the computer program instructions are executed by a cluster of computing devices, the cluster of computing devices executes the method according to any one of claims 1 to 9. method.