HK40036252B - Method, device and system for detecting fault, apparatus and storage medium - Google Patents
Method, device and system for detecting fault, apparatus and storage medium Download PDFInfo
- Publication number
- HK40036252B HK40036252B HK42021025352.2A HK42021025352A HK40036252B HK 40036252 B HK40036252 B HK 40036252B HK 42021025352 A HK42021025352 A HK 42021025352A HK 40036252 B HK40036252 B HK 40036252B
- Authority
- HK
- Hong Kong
- Prior art keywords
- request packet
- gateway server
- server
- gateway
- packet
- Prior art date
Links
Description
技术领域Technical Field
本申请涉及通信技术领域,尤其涉及一种故障检测方法、装置、系统、设备及存储介质。This application relates to the field of communication technology, and in particular to a fault detection method, apparatus, system, device and storage medium.
背景技术Background Technology
网关作为外部业务请求进入内部业务网络系统的第一道关口,其在互联网业务的持续接入中起着至关重要的作用。随着互联网业务的迅速发展和广泛普及,目前经过网关的业务流量通常是巨大的(如超过10T),在这种情况下,任何微小的、短暂的质量波动都可能给互联网业务带来难以估量的影响。As the first line of defense for external business requests entering the internal business network system, the gateway plays a crucial role in the continuous access of internet services. With the rapid development and widespread adoption of internet services, the business traffic passing through the gateway is typically enormous (e.g., exceeding 10T). In this context, even minor and transient quality fluctuations can have immeasurable impacts on internet services.
网关作为外部业务请求到达内部业务服务器的必经之路,其属于最基础的网络设施,通常具有以下特点:1)业务敏感度高,对于较为敏感的互联网业务来说,1-2分钟的故障都是不可接受的,必须做到秒级发现和处理故障;2)故障定位困难,通常情况下,网关系统中任何一个环节出现问题都可能对整个网络链路产生影响,但是由于网络环境以及网关系统架构的复杂性,检测到网络链路受到影响后往往难以准确锁定出现问题的环节。As the essential pathway for external business requests to reach internal business servers, the gateway is a fundamental network infrastructure and typically has the following characteristics: 1) High business sensitivity: For sensitive internet services, even a 1-2 minute failure is unacceptable, and faults must be detected and handled within seconds; 2) Difficult fault location: In general, a problem in any part of the gateway system can affect the entire network link. However, due to the complexity of the network environment and the gateway system architecture, it is often difficult to accurately pinpoint the problematic link after detecting an impact on the network link.
可见,如何快速准确地定位网关系统中的故障,已成为目前亟待解决的问题。It is evident that how to quickly and accurately locate faults in a gateway system has become an urgent problem to be solved.
发明内容Summary of the Invention
本申请实施例提供了一种故障检测方法、装置、系统、设备及存储介质,能够快速准确地检测网关服务器是否存在故障。This application provides a fault detection method, apparatus, system, device, and storage medium that can quickly and accurately detect whether a gateway server is faulty.
有鉴于此,本申请第一方面提供了一种故障检测方法,所述方法包括:In view of the above, the first aspect of this application provides a fault detection method, the method comprising:
在故障检测周期内,接收探测请求包所经过的网络设备各自上报的打点信息;所述打点信息是所述网络设备对所述探测请求包进行操作时生成的记录信息;所述网络设备包括客户端、网关服务器和业务服务器中的至少一个;During the fault detection period, the network devices that receive the probe request packet report the logging information they receive; the logging information is the record information generated by the network devices when they operate on the probe request packet; the network devices include at least one of a client, a gateway server, and a service server.
将包含同一请求包标识的打点信息,作为所述请求包标识所对应的探测请求包的打点信息;The marker information containing the same request packet identifier is used as the marker information of the probe request packet corresponding to the request packet identifier;
根据所述探测请求包的打点信息,对所述网络设备进行网关故障检测。Based on the tracking information of the probe request packet, the network device is used to detect gateway faults.
本申请第二方面提供了一种故障检测装置,所述装置包括:A second aspect of this application provides a fault detection device, the device comprising:
信息获取模块,用于在故障检测周期内,接收探测请求包所经过的网络设备各自上报的打点信息;所述打点信息是所述网络设备对所述探测请求包进行操作时生成的记录信息;所述网络设备包括客户端、网关服务器和业务服务器中的至少一个;The information acquisition module is used to receive, during the fault detection period, the tracking information reported by each network device through which the probe request packet passes; the tracking information is the record information generated by the network device when it operates on the probe request packet; the network device includes at least one of a client, a gateway server, and a service server;
所述信息获取模块,还用于将包含同一请求包标识的打点信息,作为所述请求包标识所对应的探测请求包的打点信息;The information acquisition module is further configured to use the marker information containing the same request packet identifier as the marker information of the probe request packet corresponding to the request packet identifier;
故障检测模块,用于根据所述探测请求包的打点信息,对所述网络设备进行网关故障检测。The fault detection module is used to perform gateway fault detection on the network device based on the logging information of the probe request packet.
本申请第三方面提供了一种故障检测系统,所述系统包括:客户端、网关服务器和故障检测服务器;A third aspect of this application provides a fault detection system, the system comprising: a client, a gateway server, and a fault detection server;
所述客户端,用于根据其对于探测请求包的发包操作和收包操作生成打点信息,并将所述打点信息上报至所述故障检测服务器;The client is configured to generate tracking information based on its packet sending and receiving operations for probe request packets, and report the tracking information to the fault detection server.
所述网关服务器,用于根据其对于所述探测请求包的发包操作和收包操作生成打点信息,并将所述打点信息上报至所述故障检测服务器;The gateway server is configured to generate tracking information based on its packet sending and receiving operations for the probe request packet, and report the tracking information to the fault detection server.
所述故障检测服务器,用于执行如上述第一方面所述的故障检测方法。The fault detection server is used to execute the fault detection method as described in the first aspect above.
本申请第四方面提供了一种设备,所述设备包括处理器以及存储器:A fourth aspect of this application provides an apparatus comprising a processor and a memory:
所述存储器用于存储计算机程序;The memory is used to store computer programs;
所述处理器用于根据所述计算机程序,执行如上述第一方面所述的故障检测方法的步骤。The processor is configured to perform the steps of the fault detection method as described in the first aspect above, according to the computer program.
本申请第五方面提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序用于执行上述第一方面所述的故障检测方法的步骤。The fifth aspect of this application provides a computer-readable storage medium for storing a computer program for performing the steps of the fault detection method described in the first aspect.
本申请第六方面提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述第一方面所述的故障检测方法的步骤。A sixth aspect of this application provides a computer program product or computer program including computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the steps of the fault detection method described in the first aspect.
从以上技术方案可以看出,本申请实施例具有以下优点:As can be seen from the above technical solutions, the embodiments of this application have the following advantages:
本申请实施例提供了一种故障检测方法,该方法借鉴医学上利用同位素或荧光剂染色定位病因的方式,对互联网中经过网络设备的探测请求包进行全链路染色打点,即利用客户端、网关服务器和业务服务器中的至少一个,根据其对于探测请求包进行的操作生成探测请求包的打点信息,进而,基于在故障检测周期内获取到的探测请求包的打点信息,对网络设备进行网关故障检测,以准确检测网络设备是否存在故障,并在检测到存在故障的情况下定位故障位置和故障原因。如此,基于探测请求包的传输链路路径,实现复杂网络环境中的网关故障检测及定位。This application provides a fault detection method that draws on medical techniques of using isotope or fluorescent staining to locate the cause of a disease. It performs end-to-end staining and marking of probe request packets passing through network devices on the Internet. Specifically, it utilizes at least one of the client, gateway server, and business server to generate marking information for the probe request packets based on their operations on the packets. Then, based on the marking information of the probe request packets acquired within a fault detection period, gateway fault detection is performed on the network devices to accurately detect whether a fault exists and, if a fault is detected, to locate the fault location and cause. Thus, based on the transmission link path of the probe request packets, gateway fault detection and location are achieved in complex network environments.
附图说明Attached Figure Description
图1为本申请实施例提供的故障检测系统的工作架构示意图;Figure 1 is a schematic diagram of the working architecture of the fault detection system provided in an embodiment of this application;
图2为本申请实施例提供的记录打点信息的原理示意图;Figure 2 is a schematic diagram illustrating the principle of recording dot information provided in an embodiment of this application;
图3为本申请实施例提供的故障检测方法的流程示意图;Figure 3 is a flowchart illustrating the fault detection method provided in an embodiment of this application;
图4为本申请实施例提供的示例性的网络架构示意图;Figure 4 is a schematic diagram of an exemplary network architecture provided in an embodiment of this application;
图5为本申请实施例提供的网关服务器中主转发程序的转发流程示意图;Figure 5 is a schematic diagram of the forwarding process of the main forwarding program in the gateway server provided in the embodiment of this application;
图6为本申请实施例提供的控制客户端与业务服务器通信的原理示意图;Figure 6 is a schematic diagram illustrating the principle of communication between the control client and the business server provided in an embodiment of this application;
图7为本申请实施例提供的染色系统的工作原理示意图;Figure 7 is a schematic diagram of the working principle of the staining system provided in the embodiment of this application;
图8为本申请实施例提供的日志转发程序和日志分析程序的原理示意图;Figure 8 is a schematic diagram of the log forwarding program and log analysis program provided in the embodiments of this application;
图9为本申请实施例提供的一种网关集群的工作性能曲线图;Figure 9 is a performance curve of a gateway cluster provided in an embodiment of this application;
图10为本申请实施例提供的另一种网关集群的工作性能曲线图;Figure 10 is a performance curve of another gateway cluster provided in an embodiment of this application;
图11为本申请实施例提供的第一种故障检测装置的结构示意图;Figure 11 is a schematic diagram of the structure of the first type of fault detection device provided in the embodiment of this application;
图12为本申请实施例提供的第二种故障检测装置的结构示意图;Figure 12 is a schematic diagram of the structure of the second type of fault detection device provided in the embodiment of this application;
图13为本申请实施例提供的第三种故障检测装置的结构示意图;Figure 13 is a schematic diagram of the structure of the third type of fault detection device provided in the embodiment of this application;
图14为本申请实施例提供的第四种故障检测装置的结构示意图;Figure 14 is a schematic diagram of the fourth type of fault detection device provided in the embodiments of this application;
图15为本申请实施例提供的服务器的结构示意图。Figure 15 is a schematic diagram of the server structure provided in an embodiment of this application.
具体实施方式Detailed Implementation
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
针对如何快速准确地对网关服务器进行故障检测这一问题,本申请实施例提供了一种故障检测方法,该方法借鉴医学上利用同位素或荧光剂染色定位病因的方式,对互联网中经过网络设备的探测请求包进行全链路染色打点,即利用客户端、网关服务器和业务服务器中的至少一个,根据其对于探测请求包进行的操作生成探测请求包的打点信息,进而,基于在故障检测周期内获取到的探测请求包的打点信息,对网络设备进行网关故障检测,以准确检测网络设备是否存在故障,并在检测到存在故障的情况下定位故障位置和故障原因。如此,基于探测请求包的传输链路路径,实现复杂网络环境中的网关故障检测及定位。To address the problem of how to quickly and accurately detect gateway server faults, this application provides a fault detection method. This method draws inspiration from medical techniques that use isotope or fluorescent staining to locate the cause of a disease. It involves staining and marking probe request packets passing through network devices on the Internet along the entire transmission chain. Specifically, at least one of the client, gateway server, and business server generates marking information for the probe request packets based on their operations on the packets. Then, based on the marking information of the probe request packets acquired within the fault detection period, gateway fault detection is performed on the network devices to accurately detect whether a fault exists. If a fault is detected, the location and cause of the fault are identified. Thus, based on the transmission path of the probe request packets, gateway fault detection and location are achieved in complex network environments.
需要说明的是,本申请实施例提供的故障检测方法通常应用于具备数据处理能力的服务器,该服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器,本申请在此不做限制。It should be noted that the fault detection method provided in this application is generally applied to servers with data processing capabilities. The server can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. This application does not impose any restrictions on these services.
为了便于理解本申请实施例提供的故障检测方法,下面先结合网关服务器的工作架构,对本申请实施例提供的故障检测方法所应用的故障检测系统进行介绍。To facilitate understanding of the fault detection method provided in the embodiments of this application, the fault detection system applied by the fault detection method provided in the embodiments of this application will be introduced below in conjunction with the working architecture of the gateway server.
参见图1,图1为本申请实施例提供的故障检测系统的工作架构示意图。如图1所示,本申请实施例提供的故障检测系统包括客户端110、网关服务器120和故障检测服务器130,该故障检测系统通常部署在互联网通信架构中,该互联网通信架构还包括业务服务器140、外网交换机150和内网交换机160。Referring to Figure 1, Figure 1 is a schematic diagram of the working architecture of the fault detection system provided in the embodiment of this application. As shown in Figure 1, the fault detection system provided in the embodiment of this application includes a client 110, a gateway server 120, and a fault detection server 130. This fault detection system is typically deployed in an Internet communication architecture, which also includes a business server 140, an external network switch 150, and an internal network switch 160.
其中,客户端110用于生成第一请求包,并通过网关服务器120向业务服务器140发送该第一请求包;在此过程中,客户端110可以根据其对于第一请求包的发包操作生成打点信息,并将所生成的打点信息上传至故障检测服务器130。在实际应用中,该客户端110可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,本申请在此不对客户端110做任何限定。In this application, client 110 generates a first request packet and sends it to business server 140 via gateway server 120. During this process, client 110 can generate tracking information based on its packet sending operation for the first request packet and upload the generated tracking information to fault detection server 130. In practical applications, client 110 can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, etc. This application does not impose any limitations on client 110.
其中,网关服务器120用于接收客户端110发送的第一请求包,对该第一请求包进行相应地处理后,将其发送给业务服务器140;在此过程中,网关服务器120可以根据其对于第一请求包的发包操作和/或收包操作生成打点信息,并将所生成的打点信息上传至故障检测服务器130。在实际应用中,网关服务器120可以是网关集群中任意一台网关服务器,即网关服务器120可以部署在包括有多台网关服务器120的网关集群中。In this system, gateway server 120 receives a first request packet sent by client 110, processes the packet accordingly, and then sends it to business server 140. During this process, gateway server 120 can generate tracking information based on its packet sending and/or receiving operations for the first request packet, and upload the generated tracking information to fault detection server 130. In practical applications, gateway server 120 can be any gateway server in a gateway cluster; that is, gateway server 120 can be deployed in a gateway cluster that includes multiple gateway servers 120.
其中,业务服务器140用于接收网关服务器120转发的第一请求包,生成与该第一请求包对应的第二请求包,并通过网关服务器120向客户端110返回该第二请求包。The business server 140 is used to receive the first request packet forwarded by the gateway server 120, generate a second request packet corresponding to the first request packet, and return the second request packet to the client 110 through the gateway server 120.
其中,网关服务器120还用于接收业务服务器140发送的第二请求包,对该第二请求包进行相应地处理后,将其发送给客户端110;在此过程中,网关服务器120可以根据其对于第二请求包的发包操作和收包操作生成的打点信息,并将所生成的打点信息上传至故障检测服务器130。The gateway server 120 is also used to receive the second request packet sent by the business server 140, process the second request packet accordingly, and send it to the client 110. During this process, the gateway server 120 can generate the tracking information based on its packet sending and receiving operations for the second request packet, and upload the generated tracking information to the fault detection server 130.
其中,客户端110还用于接收网关服务器120发送的第二请求包,并且根据其对于第二请求包的收包操作生成打点信息,将所生成的打点信息上传至故障检测服务器130。The client 110 is also used to receive the second request packet sent by the gateway server 120, and generate point information based on its packet receiving operation of the second request packet, and upload the generated point information to the fault detection server 130.
需要说明的是,在本申请实施例提供的技术方案中,上述第一请求包和第二请求包均属于本申请实施例中的探测请求包。业务服务器140接收到客户端通过网关服务器120发送的第一请求包后,针对该第一请求包生成第二请求包,并通过网关服务器120向客户端反馈该第二请求包,在此过程中,上述第一请求包和第二请求包实质上属于同一探测请求包,即该第一请求包和第二请求包中包括相同的请求包标识,示例性的,所属于同一探测请求包的第一请求包和第二请求包中可以包括相同的由源IP地址、源端口、目的IP地址和目的端口组成的四元组标识。It should be noted that in the technical solutions provided in the embodiments of this application, the first request packet and the second request packet mentioned above both belong to the probe request packets in the embodiments of this application. After the service server 140 receives the first request packet sent by the client through the gateway server 120, it generates a second request packet for the first request packet and sends the second request packet back to the client through the gateway server 120. In this process, the first request packet and the second request packet actually belong to the same probe request packet, that is, the first request packet and the second request packet include the same request packet identifier. For example, the first request packet and the second request packet belonging to the same probe request packet may include the same quadruple identifier composed of source IP address, source port, destination IP address and destination port.
其中,故障检测服务器130用于执行本申请实施例提供的故障检测方法,在故障检测周期内,接收探测请求包所经过的网络设备(包括客户端110、网关服务器120、业务服务器140中的至少一个)各自上报的打点信息,并将包含同一请求包标识的打点信息,作为该请求包标识所对应的探测请求包的打点信息;进而,根据所获取探测请求包的打点信息对网关集群中的网关服务器120进行故障检测,故障检测服务器130具体进行故障检测的过程将在下文的方法实施例中详细介绍。The fault detection server 130 is used to execute the fault detection method provided in this application embodiment. During the fault detection period, it receives the tracking information reported by each of the network devices (including at least one of client 110, gateway server 120, and service server 140) through which the probe request packet passes, and uses the tracking information containing the same request packet identifier as the tracking information of the probe request packet corresponding to the request packet identifier. Then, it performs fault detection on the gateway server 120 in the gateway cluster according to the obtained tracking information of the probe request packet. The specific process of fault detection performed by the fault detection server 130 will be described in detail in the method embodiment below.
其中,外网交换机150用于将来自外网的请求包(如客户端110发送的第一请求包)转发给网关服务器120,以及将来自网关服务器120的请求包转发给客户端110。内网交换机160用于将来自内网的请求包(如业务服务器140发送的第二请求包)转发给网关服务器120,以及将来自网关服务器120的请求包转发给业务服务器140。The external network switch 150 is used to forward request packets from the external network (such as the first request packet sent by client 110) to the gateway server 120, and to forward request packets from the gateway server 120 to client 110. The internal network switch 160 is used to forward request packets from the internal network (such as the second request packet sent by business server 140) to the gateway server 120, and to forward request packets from the gateway server 120 to business server 140.
如图1所示,在实际应用中,客户端110发送的第一请求包经过网络传输后先到达外网交换机150,该外网交换机150可以将该第一请求包发送给网关集群中的某台网关服务器120,网关服务器120接收到第一请求包后,按照一定的选择策略选择一台业务服务器140,并将经通用路由封装(Generic Routing Encapsulation,GRE)协议封装后得到的第一请求包发送至内网交换机160,通过该内网交换机160将该第一请求包发送给业务服务器140。As shown in Figure 1, in practical applications, the first request packet sent by client 110 arrives at external network switch 150 after network transmission. External network switch 150 can send the first request packet to a gateway server 120 in the gateway cluster. After receiving the first request packet, gateway server 120 selects a business server 140 according to a certain selection strategy and sends the first request packet, which is encapsulated by Generic Routing Encapsulation (GRE) protocol, to internal network switch 160. Internal network switch 160 then sends the first request packet to business server 140.
业务服务器140接收到第一请求包后,生成与该第一请求包对应的第二请求包,并将该第二请求包发送给内网交换机160,由该内网交换机160将该第二请求包发送给网关集群中的某台网关服务器120,网关服务器120收到第二请求包后拆解经GRE协议封装的GRE头,进而将该拆解后的第二请求包通过外网交换机150返回给客户端110。After receiving the first request packet, the business server 140 generates a second request packet corresponding to the first request packet and sends the second request packet to the intranet switch 160. The intranet switch 160 then sends the second request packet to a gateway server 120 in the gateway cluster. After receiving the second request packet, the gateway server 120 disassembles the GRE header encapsulated by the GRE protocol and then returns the disassembled second request packet to the client 110 through the external network switch 150.
在上述图1所示的请求包传输过程中,客户端110和网关服务器120可以针对探测请求包生成对应的打点信息。具体的,如图2所示,客户端110发出第一请求包后,可以记录探测请求包对应的第一条打点信息TP1,也可以理解为记录一条染色链路的第一个点;当网关服务器120接收到第一请求包后,可以记录探测请求包对应的第二条打点信息TP2,也可以理解为记录一条染色链路的第二个点;当网关服务器120完成对第一请求包的处理,将该第一请求包发送给业务服务器140后,可以记录探测请求包对应的第三条打点信息TP3,也可以理解为记录一条染色链路上的第三个点;当网关服务器120接收到业务服务器140针对该第一请求包返回的第二请求包后,可以记录探测请求包对应的第四条打点信息TP4,也可以理解为一条染色链路上的第四个点;当网关服务器120完成对第二请求包的处理,将该第二请求包发送给客户端110后,可以记录探测请求包对应的第五条打点信息TP5,也可以理解为一条染色链路上的第五个点;当客户端110接收到该第二请求包后,可以记录探测请求包对应的第六条打点信息TP6,也可以理解为一条染色链路上的第六个点。During the request packet transmission process shown in Figure 1, client 110 and gateway server 120 can generate corresponding marker information for the probe request packet. Specifically, as shown in Figure 2, after client 110 sends the first request packet, it can record the first marker information TP1 corresponding to the probe request packet, which can also be understood as recording the first point of a colored link; when gateway server 120 receives the first request packet, it can record the second marker information TP2 corresponding to the probe request packet, which can also be understood as recording the second point of a colored link; when gateway server 120 completes the processing of the first request packet and sends it to business server 140, it can record the third marker information TP3 corresponding to the probe request packet, which can also be understood as recording the third point on ... After receiving the second request packet returned by the business server 140 in response to the first request packet, the server 120 can record the fourth marker information TP4 corresponding to the probe request packet, which can also be understood as the fourth point on a colored link. After the gateway server 120 completes the processing of the second request packet and sends it to the client 110, it can record the fifth marker information TP5 corresponding to the probe request packet, which can also be understood as the fifth point on a colored link. After the client 110 receives the second request packet, it can record the sixth marker information TP6 corresponding to the probe request packet, which can also be understood as the sixth point on a colored link.
客户端110和网关服务器120完成探测请求包对应的打点信息的记录后,需要将其记录的探测请求包对应的打点信息发送给故障检测服务器130,以便故障检测服务器130基于这些打点信息检测网关集群中的网关服务器120是否存在故障,并在检测到存在故障的情况下定位网关服务器120的故障原因。After the client 110 and the gateway server 120 complete the recording of the logging information corresponding to the probe request packet, they need to send the logging information corresponding to the probe request packet they recorded to the fault detection server 130 so that the fault detection server 130 can detect whether the gateway server 120 in the gateway cluster has a fault based on this logging information, and locate the cause of the fault in the gateway server 120 if a fault is detected.
应理解,通常情况下,为了准确检测网关服务器120是否存在故障,客户端110与业务服务器140需要在故障检测周期内交互大量不同的探测请求包,相应地,故障检测服务器130可以获取到客户端110和网关服务器120针对这些探测请求包记录的打点信息,并基于大量探测请求包各自对应的打点信息,进行对于网关服务器120的故障检测。It should be understood that, under normal circumstances, in order to accurately detect whether the gateway server 120 is faulty, the client 110 and the business server 140 need to exchange a large number of different probe request packets during the fault detection period. Accordingly, the fault detection server 130 can obtain the tracking information recorded by the client 110 and the gateway server 120 for these probe request packets, and perform fault detection on the gateway server 120 based on the tracking information corresponding to each of the large number of probe request packets.
需要说明的是,在实际应用中,除了可以由客户端110和网关服务器120向故障检测服务器130上报探测请求包的打点信息外,也可以由网关服务器120独自向故障检测服务器130上报探测请求包的打点信息,还可以由客户端110、网关服务器120和业务服务器140协同向故障检测服务器130上报探测请求包的打点信息,本申请在此不对故障检测服务器130所接收的打点信息的来源做任何限定。It should be noted that in practical applications, in addition to the client 110 and gateway server 120 reporting the tracking information of the probe request packet to the fault detection server 130, the gateway server 120 can also report the tracking information of the probe request packet to the fault detection server 130 on its own. Alternatively, the client 110, gateway server 120 and business server 140 can also collaboratively report the tracking information of the probe request packet to the fault detection server 130. This application does not impose any restrictions on the source of the tracking information received by the fault detection server 130.
下面通过方法实施例对本申请提供的故障检测方法进行详细介绍。The fault detection method provided in this application will be described in detail below through method embodiments.
参见图3,图3为本申请实施例提供的故障检测方法的流程示意图。下述实施例以执行主体为故障检测服务器为例进行介绍。如图3所示,该故障检测方法包括以下步骤:Referring to Figure 3, which is a flowchart illustrating the fault detection method provided in this embodiment, the following embodiment uses a fault detection server as the executing entity. As shown in Figure 3, the fault detection method includes the following steps:
步骤301:在故障检测周期内,接收探测请求包所经过的网络设备各自上报的打点信息;所述打点信息是所述网络设备对所述探测请求包进行操作时生成的记录信息;所述网络设备包括客户端、网关服务器和业务服务器中的至少一个。Step 301: During the fault detection period, receive the tracking information reported by each network device through which the probe request packet passes; the tracking information is the record information generated by the network device when it operates on the probe request packet; the network device includes at least one of a client, a gateway server, and a service server.
步骤302:将包含同一请求包标识的打点信息,作为所述请求包标识所对应的探测请求包的打点信息。Step 302: Use the logging information containing the same request packet identifier as the logging information of the probe request packet corresponding to the request packet identifier.
由于步骤301和步骤302的关联性较强,故下文将步骤301和步骤302整合起来,对步骤301和步骤302的整体实现过程进行介绍。Since steps 301 and 302 are closely related, the following text will integrate steps 301 and 302 and introduce the overall implementation process of steps 301 and 302.
在故障检测周期内,客户端可以通过网关服务器与业务服务器交互大量的探测请求包,在每个探测请求包传输的过程中,网关服务器可以根据自身对于该探测请求包的发包操作和/或收包操作生成该探测请求包对应的打点信息,进而将所生成的打点信息发送给故障检测服务器。During the fault detection period, the client can interact with the business server through the gateway server to exchange a large number of probe request packets. During the transmission of each probe request packet, the gateway server can generate the corresponding point information for the probe request packet based on its own packet sending and/or packet receiving operations, and then send the generated point information to the fault detection server.
为了保证故障检测服务器能够获知更完整的探测请求包对应的传输链路路径,在实际应用中,用于发送探测请求包的客户端也可以根据自身对于探测请求包的发包操作和收包操作,生成探测请求包对应的打点信息,并将所生成的打点信息发送给故障检测服务器。To ensure that the fault detection server can obtain a more complete transmission link path corresponding to the probe request packet, in practical applications, the client used to send the probe request packet can also generate the corresponding tracking information based on its own packet sending and receiving operations, and send the generated tracking information to the fault detection server.
需要说明的是,打点信息中通常包含有请求包标识,该请求包标识与生成该打点信息时所依据的探测请求包相对应,即打点信息中包含的请求包标识,实际上即为探测请求包对应的请求包标识;对应于同一探测请求包的各条打点信息应当包含相同的请求包标识,而对应于不同探测请求包的各条打点信息硬蛋包含不同的请求包标识。如此,便于故障检测服务器根据打点信息中包含的请求包标识,来识别该打点信息具体对应的探测请求包,进而便于故障检测服务器确定出一个探测请求包对应的所有打点信息,并基于此确定该探测请求包的传输路径。It should be noted that the tracking information typically includes a request packet identifier. This identifier corresponds to the probe request packet used to generate the tracking information; that is, the request packet identifier contained in the tracking information is actually the request packet identifier corresponding to the probe request packet. Tracking information corresponding to the same probe request packet should contain the same request packet identifier, while tracking information corresponding to different probe request packets should contain different request packet identifiers. This allows the fault detection server to identify the specific probe request packet corresponding to a tracking information based on the request packet identifier, thereby facilitating the server to determine all tracking information corresponding to a probe request packet and, based on this, determine the transmission path of that probe request packet.
由于一个探测请求包对应的各个打点信息通常可以标识该探测请求包对应的传输链路路径,因此,故障检测服务器可以通过分析大量探测请求包各自对应的传输链路路径,来检测网关服务器是否存在故障,这种利用打点信息标识探测请求包对应的传输链路路径的方式,与医学上利用同位素或荧光剂染色定位病因的方式相类似,因此本申请中基于打点信息定位网关服务器故障的方式也可以被称为染色。Since the various markers corresponding to a probe request packet can usually identify the transmission link path corresponding to that probe request packet, the fault detection server can detect whether the gateway server is faulty by analyzing the transmission link paths corresponding to a large number of probe request packets. This method of using marker information to identify the transmission link path corresponding to the probe request packet is similar to the medical method of using isotopes or fluorescent staining to locate the cause of disease. Therefore, the method of locating gateway server faults based on marker information in this application can also be called staining.
需要说明的是,上述对于探测请求包的发包操作具体包括:发送第一请求包的操作和发送第二请求包的操作;上述对于探测请求包的收包操作具体包括:接收第一请求包的操作和接收第二请求包的操作。其中,第一请求包是客户端通过网关服务器向业务服务器发送的请求包,第二请求包是业务服务器针对其接收的第一请求包生成的请求包,业务服务器需要通过网关服务器向客户端反馈该第二请求包,第二请求包与第一请求包之间具有对应关系,并且具有对应关系的第一请求包和第二请求包属于同一探测请求包,其中包含相同的请求包标识。It should be noted that the above-mentioned packet sending operation for the probe request packet specifically includes: sending the first request packet and sending the second request packet; the above-mentioned packet receiving operation for the probe request packet specifically includes: receiving the first request packet and receiving the second request packet. The first request packet is a request packet sent by the client to the business server through the gateway server. The second request packet is a request packet generated by the business server in response to the received first request packet. The business server needs to send the second request packet back to the client through the gateway server. The second request packet and the first request packet have a corresponding relationship, and the corresponding first and second request packets belong to the same probe request packet, containing the same request packet identifier.
更具体的,对于网关服务器来说,其需要根据自身接收客户端发送的第一请求包的操作生成一条探测请求包对应的打点信息,根据自身向业务服务器发送第一请求包的操作生成一条探测请求包对应的打点信息,根据自身接收业务服务器发送的第二请求包的操作生成一条探测请求包对应的打点信息,根据自身向客户端发送第二请求包的操作生成一条探测请求包对应的打点信息。对于客户端来说,其需要根据自身发送第一请求包的操作生成一条探测请求包对应的打点信息,以及根据自身接收第二请求包的操作生成一条探测请求包对应的打点信息。More specifically, for the gateway server, it needs to generate a logging record corresponding to a probe request packet based on its operation of receiving the first request packet sent by the client, generating another logging record corresponding to a probe request packet based on its operation of sending the first request packet to the business server, generating another logging record corresponding to a probe request packet based on its operation of receiving the second request packet sent by the business server, and generating yet another logging record corresponding to a probe request packet based on its operation of sending the second request packet to the client. For the client, it needs to generate a logging record corresponding to a probe request packet based on its operation of sending the first request packet, and also generate another logging record corresponding to a probe request packet based on its operation of receiving the second request packet.
在网关服务器不存在故障的情况下,对于一个探测请求包,故障检测服务器至少应当接收到与该探测请求包对应的六条打点信息,这六条打点信息能够表示出探测请求包对应的完整的传输链路路径。反之,若对于一个探测请求包,故障检测服务器没有接收到六条打点信息,则说明该探测请求包在传输的过程中被丢包,进一步反映出网关服务器可能存在故障。Assuming the gateway server is functioning correctly, for a probe request packet, the fault detection server should receive at least six tagged information entries corresponding to that packet. These six entries should represent the complete transmission link path of the probe request packet. Conversely, if the fault detection server does not receive six tagged information entries for a probe request packet, it indicates that the packet was lost during transmission, further suggesting a potential fault in the gateway server.
示例性的,如图2所示,本申请定义了6个记录并上传打点信息的节点:客户端已发包TP_CLIENT_SEND(简称为TP1)、网关服务器入方向已收包TP_LD_FRONT_RCV(简称为TP2)、网关服务器入方向已发包TP_LD_FRONT_SND(简称为TP3)、网关服务器出方向已收包TP_LD_BACK_RCV(简称为TP4)、网关服务器出方向已发包TP_LD_BACK_RCV(简称为TP5)、客户端已收包TP_CLIENT_RCV(简称为TP6)。上述入方向是指传输第一请求包的方向,上述出方向是指传输与第一请求包对应的第二请求包的方向。For example, as shown in Figure 2, this application defines six nodes that record and upload tracking information: TP_CLIENT_SEND (TP1), TP_LD_FRONT_RCV (TP2), TP_LD_FRONT_SND (TP3), TP_LD_BACK_RCV (TP4), TP_LD_BACK_RCV (TP5), and TP_CLIENT_RCV (TP6). The "inbound" direction refers to the direction of transmitting the first request packet, and the "outbound" direction refers to the direction of transmitting the second request packet corresponding to the first request packet.
具体的,客户端发出第一请求包后,可以记录探测请求包对应的第一条打点信息trace_point=TP1,也可以理解为记录一条染色链路的第一个点;当网关服务器接收到第一请求包后,可以记录探测请求包对应的第二条打点信息trace_point=TP2,也可以理解为记录一条染色链路的第二个点;当网关服务器完成对第一请求包的处理,将该第一请求包发送给业务服务器后,可以记录探测请求包对应的第三条打点信息trace_point=TP3,也可以理解为记录一条染色链路上的第三个点;当网关服务器接收到业务服务器针对该第一请求包返回的第二请求包后,可以记录探测请求包对应的第四条打点信息trace_point=TP4,也可以理解为一条染色链路上的第四个点;当网关服务器完成对第二请求包的处理,将该第二请求包发送给客户端后,可以记录探测请求包对应的第五条打点信息trace_point=TP5,也可以理解为一条染色链路上的第五个点;当客户端接收到该第二请求包后,可以记录探测请求包对应的第六条打点信息trace_point=TP6,也可以理解为一条染色链路上的第六个点。Specifically, after the client sends the first request packet, it can record the first trace point corresponding to the probe request packet (trace_point = TP1), which can also be understood as recording the first point of a colored link. When the gateway server receives the first request packet, it can record the second trace point corresponding to the probe request packet (trace_point = TP2), which can also be understood as recording the second point of a colored link. After the gateway server completes the processing of the first request packet and sends it to the business server, it can record the third trace point corresponding to the probe request packet (trace_point = TP3), which can also be understood as recording the third point on a colored link. After the server receives the second request packet returned by the business server in response to the first request packet, it can record the fourth trace point information corresponding to the probe request packet, trace_point=TP4, which can also be understood as the fourth point on a colored link. When the gateway server completes the processing of the second request packet and sends it to the client, it can record the fifth trace point information corresponding to the probe request packet, trace_point=TP5, which can also be understood as the fifth point on a colored link. When the client receives the second request packet, it can record the sixth trace point information corresponding to the probe request packet, trace_point=TP6, which can also be understood as the sixth point on a colored link.
应理解,在实际应用中,为了保证故障检测服务器能够获知更完整的探测请求包对应的传输链路路径,业务服务器也可以根据自身对于探测请求包进行的发包操作和/或收包操作生成打点信息,并将所生成的打点信息上报至故障检测服务器。具体的,业务服务器接收到网关服务器转发的来自客户端的第一请求包后,可以相应地生成一条打点信息上报至故障检测服务器,业务服务器针对第一请求包生成第二请求包,并将第二请求包发送给网关服务器(以通过该网关服务器将该第二请求包转发给客户端)后,可以相应地生成一条打点信息上报至故障检测服务器。在该种场景下,若网关服务器不存在故障,对于一个探测请求包(其中包括具有对应关系的第一请求包和第二请求包),故障检测服务器应接收到与该探测请求包对应的八条打点信息,这八条打点信息能够表示出探测请求包对应的完整的传输链路路径。It should be understood that in practical applications, to ensure that the fault detection server can obtain a more complete transmission link path corresponding to the probe request packet, the business server can also generate tracking information based on its own packet sending and/or receiving operations for the probe request packet, and report the generated tracking information to the fault detection server. Specifically, after receiving the first request packet from the client forwarded by the gateway server, the business server can generate a corresponding tracking information and report it to the fault detection server. After the business server generates a second request packet for the first request packet and sends the second request packet to the gateway server (so that the gateway server can forward the second request packet to the client), it can generate a corresponding tracking information and report it to the fault detection server. In this scenario, if the gateway server is not faulty, for a probe request packet (including a first request packet and a second request packet with a corresponding relationship), the fault detection server should receive eight tracking information corresponding to the probe request packet. These eight tracking information can represent the complete transmission link path corresponding to the probe request packet.
可选的,为了使故障检测服务器能够更准确地基于探测请求包对应的传输链路路径分析网关服务器是否存在故障,通常需要保证在一个故障检测周期内客户端发出的多个探测请求包均不相同,从而避免故障检测服务器对其接收的打点信息产生混淆,不清楚其具体对应的探测请求包。Optionally, in order for the fault detection server to more accurately analyze whether the gateway server is faulty based on the transmission link path corresponding to the probe request packet, it is usually necessary to ensure that the multiple probe request packets sent by the client within a fault detection cycle are different, so as to avoid the fault detection server being confused by the tracking information it receives and not knowing the specific probe request packet it corresponds to.
具体的,可以利用四元组标识唯一表征探测请求包,即采用四元组标识作为同一探测请求包对应的各条打点信息的请求包标识,四元组标识中包括源IP地址cip、源端口cport、目的IP地址vip和目的端口vport;相应地,在一个故障检测周期内,可以控制客户端和业务服务器交互所对应的四元组标识各不相同的探测请求包,由此故障检测服务器在故障检测周期内,将接收到多个分别包含不同的四元组标识的的打点信息。Specifically, a quadruple identifier can be used to uniquely represent a probe request packet. That is, a quadruple identifier is used as the request packet identifier for each logging information corresponding to the same probe request packet. The quadruple identifier includes the source IP address cip, the source port cport, the destination IP address vip, and the destination port vport. Accordingly, within a fault detection cycle, the probe request packets with different quadruple identifiers can be controlled to interact between the client and the business server. Thus, within the fault detection cycle, the fault detection server will receive multiple logging information containing different quadruple identifiers.
即可以利用(cip,cport,vip,vport)四元组作为探测请求包的唯一标识,通过传输控制协议(Transmission Control Protocol,TCP)传递给网关服务器和业务服务器;在实际设计中,由于cport有60000+个可以使用,客户端可以对特定业务规则(vip,vport)每5s拨测一次,因此,使用上述四元组标识探测请求包通常需要几天才会重复交互同一个探测请求包(即对应相同标识四元组的探测请求包),而一个故障检测周期通常是分钟级别的,由此可以保证故障检测服务器在一个故障检测周期内可以获取到对应于不同的四元组标识的探测请求包所对应的打点信息,如此达到染色的目的。The (cip, cport, vip, vport) quadruple can be used as a unique identifier for probe request packets, which are then transmitted to the gateway server and the business server via the Transmission Control Protocol (TCP). In actual design, since there are more than 60,000 cports available, and the client can probe every 5 seconds for a specific business rule (vip, vport), the probe request packets using the above quadruple identifier will usually only be repeatedly interacted with for several days (i.e., probe request packets corresponding to the same identifier quadruple). Since a fault detection cycle is usually on the order of minutes, this ensures that the fault detection server can obtain the marking information corresponding to probe request packets with different quadruple identifiers within a fault detection cycle, thus achieving the purpose of coloring.
步骤302:根据所述探测请求包的打点信息,对所述网络设备进行网关故障检测。Step 302: Based on the logging information of the probe request packet, perform gateway fault detection on the network device.
故障检测服务器获取到探测请求包对应的打点信息后,可以基于所获取的打点信息,对网关服务器进行多个维度的故障检测,如检测网关服务器是否故障、网关服务器是否存在抖动性丢包故障、网关服务器是否存在异常的业务规则等等。After obtaining the tracking information corresponding to the probe request packet, the fault detection server can perform multi-dimensional fault detection on the gateway server based on the obtained tracking information, such as detecting whether the gateway server is faulty, whether the gateway server has jittery packet loss faults, whether the gateway server has abnormal business rules, etc.
在一些实施例中,故障检测服务器可以根据打点信息检测网关集群中的网关服务器是否故障,即当网关集群的转发成功率出现下降告警时,故障检测服务器可以根据其获取到的打点信息,定位网关集群中出现故障的网关服务器。In some embodiments, the fault detection server can detect whether the gateway server in the gateway cluster is faulty based on the tracking information. That is, when the forwarding success rate of the gateway cluster drops and an alarm is triggered, the fault detection server can locate the faulty gateway server in the gateway cluster based on the tracking information it has obtained.
具体的,可以将网关服务器根据其与探测请求包相关的收包操作生成的打点信息分为前端收包信息和后端收包信息,其中,前端收包信息是网关服务器对第一请求包完成收包操作时生成的记录信息,后端收包信息是网关服务器对第二请求包完成收包操作时生成的记录信息,其中,第一请求包是客户端通过网关服务器向业务服务器发送的请求包,该第二请求包是业务服务器接收到第一请求包后,通过网关服务器向客户端反馈的请求包,该第一请求包和第二请求包中包含相同的请求包标识。进而,故障检测服务器可以针对网关集群中每台网关服务器,根据该网关服务器上传的前端收包信息和后端收包信息,分别统计该网关服务器的前端收包数和后端收包数,并根据网关集群中各台网关服务器的前端收包数和后端收包数,确定网关集群中的故障网关服务器。Specifically, the logging information generated by the gateway server based on its packet reception operations related to probe request packets can be divided into front-end packet reception information and back-end packet reception information. Front-end packet reception information is the record information generated by the gateway server when it completes the packet reception operation for the first request packet, while back-end packet reception information is the record information generated by the gateway server when it completes the packet reception operation for the second request packet. The first request packet is the request packet sent by the client to the business server through the gateway server, and the second request packet is the request packet sent back to the client by the business server after receiving the first request packet. Both the first and second request packets contain the same request packet identifier. Furthermore, the fault detection server can, for each gateway server in the gateway cluster, count the number of front-end packets received and the number of back-end packets received for that gateway server based on the front-end and back-end packet reception information uploaded by that gateway server, and determine the faulty gateway server in the gateway cluster based on the front-end and back-end packet reception counts of each gateway server in the cluster.
示例性的,在客户端和网关服务器需要向故障检测服务器上传打点信息包括TP1至TP6的情况下,TP2属于前端收包信息,TP4属于后端收包信息。针对网关集群中的每台网关服务器,故障检测服务器需要统计该网关服务器上传的TP2的数量(即前端收包数),以及该网关服务器上传的TP4的数量(即后端收包数);进而,根据网关集群中各台网关服务器上传的TP2的数量和TP4的数量,确定出网关集群中的故障网关服务器。For example, when the client and gateway server need to upload tracking information (TP1 to TP6) to the fault detection server, TP2 belongs to front-end packet reception information, and TP4 belongs to back-end packet reception information. For each gateway server in the gateway cluster, the fault detection server needs to count the number of TP2 uploaded by that gateway server (i.e., the number of front-end packets received) and the number of TP4 uploaded by that gateway server (i.e., the number of back-end packets received); then, based on the number of TP2 and TP4 uploaded by each gateway server in the gateway cluster, the faulty gateway server in the gateway cluster is determined.
在一种可能的实现方式中,故障检测服务器可以根据网关服务器的前端收包数和/或后端收包数是否掉底,来确定故障网关服务器。即故障检测服务器可以针对网关集群中的每台网关服务器,判断该网关服务器的前端收包数和后端收包数是否掉底,若前端收包数和后端收包数中任一项或多项掉底,则可以确定该网关服务器为故障网关服务器。In one possible implementation, the fault detection server can determine the faulty gateway server based on whether the number of packets received at the front end and/or the number of packets received at the back end of the gateway server has reached its limit. That is, the fault detection server can check whether the number of packets received at the front end and the number of packets received at the back end of each gateway server in the gateway cluster has reached its limit. If any one or more of the number of packets received at the front end and the number of packets received at the back end have reached their limit, then the gateway server can be determined to be a faulty gateway server.
具体的,故障检测服务器可以根据网关集群中各台网关服务器的前端收包数,绘制网关集群对应的前端收包数曲线图,进而,根据网关集群对应的前端收包数曲线,判断网关集群中是否存在前端收包数掉底的网关服务器,若存在,即可直接确定前端收包数掉底的网关服务器为故障网关服务器。相类似地,故障检测服务器可以根据网关集群中各台网关服务器的后端收包数,绘制网关集群对应的后端收包数曲线图,进而,根据网关集群对应的后端收包数曲线,判断网关集群中是否存在后端收包数掉底的网关服务器,若存在,即可直接确定后端收包数掉底的网关服务器为故障网关服务器。Specifically, the fault detection server can plot a front-end packet reception count curve for the gateway cluster based on the front-end packet reception count of each gateway server in the cluster. Then, based on this curve, it can determine if any gateway server in the cluster has a front-end packet reception count that has reached its lowest point. If so, the gateway server with the lowest front-end packet reception count is directly identified as a faulty gateway server. Similarly, the fault detection server can plot a back-end packet reception count curve for the gateway cluster based on the back-end packet reception count of each gateway server in the cluster. Then, based on this curve, it can determine if any gateway server in the cluster has a back-end packet reception count that has reached its lowest point. If so, the gateway server with the lowest back-end packet reception count is directly identified as a faulty gateway server.
在另一种可能的实现方式中,故障检测服务器可以通过判断网关服务器的前端收包数和/或后端收包数是否明显低于网关集群中其它网关服务器的前端收包数和/或后端收包数,来确定故障网关服务器。即故障检测服务器可以根据网关集群中各台网关服务器的前端收包数和后端收包数,分别确定前端收包平均阈值和后端收包平均阈值;进而,针对网关集群中每台网关服务器,确定前端收包平均阈值与该网关服务器的前端收包数之间的第一差值,并判断该第一差值是否超过预设前端差值阈值,若是,则确定该网关服务器为故障网关服务器;以及,针对网关集群中每台网关服务器,确定后端收包平均阈值与该网关服务器的后端收包数之间的第二差值,并判断该第二差值是否超过预设后端差值阈值,若是,则确定该网关服务器为故障网关服务器。In another possible implementation, the fault detection server can determine a faulty gateway server by judging whether the number of front-end packets received and/or the number of back-end packets received by the gateway server is significantly lower than the number of front-end packets received and/or the number of back-end packets received by other gateway servers in the gateway cluster. Specifically, the fault detection server can determine the average front-end packet reception threshold and the average back-end packet reception threshold based on the number of front-end packets received and the number of back-end packets received by each gateway server in the gateway cluster. Then, for each gateway server in the gateway cluster, it determines a first difference between the average front-end packet reception threshold and the number of front-end packets received by that gateway server, and determines whether the first difference exceeds a preset front-end difference threshold. If so, the gateway server is determined to be a faulty gateway server. Furthermore, for each gateway server in the gateway cluster, it determines a second difference between the average back-end packet reception threshold and the number of back-end packets received by that gateway server, and determines whether the second difference exceeds a preset back-end difference threshold. If so, the gateway server is determined to be a faulty gateway server.
具体的,故障检测服务器可以根据网关集群中各台网关服务器的前端收包数,绘制网关集群对应的前端收包曲线图,该前端收包数曲线图能够反映网关集群的前端收包平均阈值;进而,故障检测服务器可以根据网关集群对应的前端收包数曲线图,判断是否存在前端收包数明显低于其它网关服务器的前端收包数的网关服务器,即判断是否存在前端收包数与前端收包平均阈值之间的差值超过预设前端差值阈值的网关服务器,若存在,则可以直接将该网关服务器确定为故障网关服务器。Specifically, the fault detection server can plot a front-end packet reception curve for the gateway cluster based on the number of packets received by each gateway server in the cluster. This front-end packet reception curve reflects the average front-end packet reception threshold of the gateway cluster. Furthermore, the fault detection server can use the front-end packet reception curve to determine whether there are gateway servers whose front-end packet reception count is significantly lower than that of other gateway servers. In other words, it can determine whether there are gateway servers whose difference between their front-end packet reception count and the average front-end packet reception threshold exceeds a preset front-end difference threshold. If such a gateway server exists, it can be directly identified as a faulty gateway server.
相类似地,故障检测服务器可以根据网关集群中各台网关服务器的后端收包数,绘制网关集群对应的后端收包曲线图,该后端收包数曲线图能够反映网关集群的后端收包平均阈值;进而,故障检测服务器可以根据网关集群对应的后端收包数曲线图,判断是否存在后端收包数明显低于其它网关服务器的后端收包数的网关服务器,即判断是否存在后端收包数与后端收包平均阈值之间的差值超过预设后端差值阈值的网关服务器,若存在,则可以直接将该网关服务器确定为故障网关服务器。Similarly, the fault detection server can plot a backend packet reception curve for the gateway cluster based on the number of packets received by each gateway server in the gateway cluster. This backend packet reception curve can reflect the average backend packet reception threshold of the gateway cluster. Furthermore, the fault detection server can use the backend packet reception curve for the gateway cluster to determine whether there are gateway servers whose backend packet reception count is significantly lower than that of other gateway servers. In other words, it can determine whether there are gateway servers whose difference between their backend packet reception count and the average backend packet reception threshold exceeds a preset backend difference threshold. If such a gateway server exists, it can be directly identified as a faulty gateway server.
如此,在检测到网关集群的转发成功率出现下降告警时,根据网关服务器上传的打点信息(即前端收包信息和后端收包信息)快速准确地定位网关集群中的故障网关服务器,即定位网关集群中转发性能存在故障的网关服务器。相比现有技术中由运维人员逐一检测各台网关服务器的流量情况,来定位故障网关服务器的实现方式,本申请实施例提供的方式更加快速高效。Thus, when an alarm indicating a drop in the forwarding success rate of the gateway cluster is detected, the faulty gateway server in the gateway cluster can be quickly and accurately located based on the tracking information uploaded by the gateway server (i.e., front-end packet reception information and back-end packet reception information). This means locating the gateway server in the gateway cluster with a forwarding performance failure. Compared to the existing technology where maintenance personnel manually check the traffic of each gateway server to locate the faulty gateway server, the method provided in this application is faster and more efficient.
在一些实施例中,故障检测服务器还可以检测网关服务器是否存在抖动性丢包故障。具体的,当一台网关服务器存在抖动性丢包故障时,网关集群中每台网关服务器的收包数实际上是没有很大区别的,因此单纯地通过上述方式无法检测出网关服务器是否存在抖动性丢包故障。为了检测网关服务器的抖动性丢包故障,本申请实施例还提供了一种抖动性丢包故障的检测方式。In some embodiments, the fault detection server can also detect whether the gateway server has a jitter packet loss fault. Specifically, when a gateway server has a jitter packet loss fault, the number of packets received by each gateway server in the gateway cluster is actually not significantly different. Therefore, the above method alone cannot detect whether the gateway server has a jitter packet loss fault. In order to detect the jitter packet loss fault of the gateway server, this application embodiment also provides a method for detecting jitter packet loss fault.
下面先结合图4所示的网络架构进行分析,以客户端—>网关服务器—>业务服务器的方向为例,外网交换机和网关集群中的多台网关服务器之间通过OSPF(Open ShortestPath First)路由协议组成邻居网络,该邻居网络中每台设备都可以获得等价的路由。相类似地,在业务服务器—>网关服务器—>客户端的方向上,内网交换机和网关集群中的多台网关服务器之间也通过OSPF路由协议组成了邻居网络。The following analysis uses the network architecture shown in Figure 4 as an example. Taking the client -> gateway server -> service server direction as an example, the external network switch and multiple gateway servers in the gateway cluster form a neighbor network through the OSPF (Open Shortest Path First) routing protocol. Each device in this neighbor network can obtain an equivalent route. Similarly, in the service server -> gateway server -> client direction, the internal network switch and multiple gateway servers in the gateway cluster also form a neighbor network through the OSPF routing protocol.
在客户端—>网关服务器—>业务服务器的方向上,当第一请求包到达外网交换机时,外网交换机会通过一定的哈希策略将第一请求包近似等概率的分发到各台网关服务器上,示例性的,所采用的哈希策略可以是二元组哈希,即根据第一请求包中的源IP和目的IP进行哈希。相类似地,在业务服务器—>网关服务器—>客户端的方向上,当第二请求包到达内网交换机时,内网交换机也会通过一定的哈希策略将第二请求包近似等概率的分发到各台网关服务器上。In the client-to-gateway-server-to-service-server direction, when the first request packet arrives at the external network switch, the external network switch will distribute the first request packet to each gateway server with approximately equal probability using a certain hash strategy. For example, the hash strategy used could be a binary hash, that is, hashing based on the source IP and destination IP in the first request packet. Similarly, in the service-to-gateway-server-to-client direction, when the second request packet arrives at the internal network switch, the internal network switch will also distribute the second request packet to each gateway server with approximately equal probability using a certain hash strategy.
基于上述原理可以确定,在大多数基于OSPF路由协议组成的邻居网络中,如果邻居关系稳定,那么包括相同的源IP地址和目的IP地址的请求包总会落到一台固定的网关服务器上。基于此,故障检测服务器可以在数据分析程序中维护每一个IP地址组合(其中包括源IP地址和目的IP地址)到前端网关服务器(即第一请求包经过的网关服务器)和后端网关服务器(即第二请求包经过的网关服务器)的映射关系。Based on the above principles, it can be determined that in most neighbor networks based on the OSPF routing protocol, if the neighbor relationships are stable, request packets with the same source IP address and destination IP address will always end up on a fixed gateway server. Therefore, the fault detection server can maintain a mapping relationship in the data analysis program between each IP address combination (including the source IP address and destination IP address) and the front-end gateway server (i.e., the gateway server through which the first request packet passes) and the back-end gateway server (i.e., the gateway server through which the second request packet passes).
即,故障检测服务器可以获取历史探测请求包对应的打点信息;根据历史探测请求包对应的打点信息,确定历史探测请求包经过的前端网关服务器和后端网关服务器,该历史探测请求包包括历史第一请求包和历史第二请求包,历史第一请求包是客户端通过前端网关服务器向业务服务器发送的请求包,历史第二请求包是业务服务器接收到历史第一请求包后,通过后端网关服务器向客户端反馈的请求包;进而,构建历史探测请求包中包括的IP地址组合与该前端网关服务器、后端网关服务器之间的映射关系,该IP地址组合包括源IP地址和目的IP地址。That is, the fault detection server can obtain the tracking information corresponding to historical probe request packets; based on the tracking information corresponding to the historical probe request packets, it determines the front-end gateway server and back-end gateway server through which the historical probe request packets pass. The historical probe request packets include a first historical request packet and a second historical request packet. The first historical request packet is a request packet sent by the client to the business server through the front-end gateway server, and the second historical request packet is a request packet sent by the business server to the client through the back-end gateway server after receiving the first historical request packet; then, it constructs a mapping relationship between the IP address combination included in the historical probe request packets and the front-end gateway server and the back-end gateway server. The IP address combination includes the source IP address and the destination IP address.
换言之,故障检测服务器可以自动地根据客户端和网关服务器上传的探测请求包对应的打点信息,学习探测请求包中包括的IP地址组合(cip,vip)与该第一请求包经过的前端网关服务器之间的映射关系,以及该IP地址组合(cip,vip)与第二请求包(对应于该第一请求包)经过的后端网关服务器之间的映射关系。In other words, the fault detection server can automatically learn the mapping relationship between the IP address combination (cip, vip) included in the probe request packet and the front-end gateway server through which the first request packet passes, as well as the mapping relationship between the IP address combination (cip, vip) and the back-end gateway server through which the second request packet (corresponding to the first request packet) passes, based on the tracking information corresponding to the probe request packets uploaded by the client and the gateway server.
此外,考虑到业务服务器向网关服务器发送第二请求包时,会为该第二请求包封装一层GRE头放在该第二请求包的网络层前(如图4所示),因此导致内网交换机中包含内层哈希(对应于不包括GRE头的请求包)和外层哈希(对应于包括GRE头的请求包)两种,由此引入一个新的问题,即对于内层哈希,内网交换机可以直接通过请求包中包括的IP地址组合确定出特定的网关服务器,但是对于外层哈希,内网交换机需要利用封装在GRE头中的IP地址组合(rs_ip,tsv)进行哈希,这种情况下对于包括相同IP地址组合的请求包来说,其通过的后端网关服务器可能存在不同。Furthermore, considering that when the business server sends a second request packet to the gateway server, it will encapsulate the second request packet with a GRE header and place it before the network layer of the second request packet (as shown in Figure 4), this results in the internal network switch containing two types of hashes: inner hash (corresponding to request packets without GRE headers) and outer hash (corresponding to request packets with GRE headers). This introduces a new problem: for inner hashes, the internal network switch can directly determine the specific gateway server through the IP address combination included in the request packet; however, for outer hashes, the internal network switch needs to use the IP address combination (rs_ip, tsv) encapsulated in the GRE header for hashing. In this case, for request packets containing the same IP address combination, the backend gateway servers they pass through may be different.
为了解决上述问题,可以控制将目标客户端发送的历史第一请求发送至目标业务服务器,该目标客户端和目标业务服务器之间具有对应关系。即,可以在网关集群中建立专门用于探测的规则,并将这些规则设置成会话保持,在网关服务器转发第一请求包时,控制将同一台客户端发送的第一请求包(包括相同的cip)始终转发到同一台业务服务器上,即控制将目标客户端发送的第一请求包始终转发到目标业务服务器上,目标客户端和目标业务服务器具有对应关系,由于vip与tsv是固定对应的,因此包括相同的(cip,vip)的请求还是可以落到同一台后端网关服务器上。To address the aforementioned issues, it's possible to control the forwarding of the first historical request from the target client to the target business server, establishing a correspondence between the target client and the target business server. Specifically, dedicated probing rules can be established within the gateway cluster, and these rules can be configured for session persistence. When the gateway server forwards the first request packet, it can control the forwarding of first request packets from the same client (including those with the same CIP) to the same business server. In other words, the first request packet from the target client is always forwarded to the target business server, maintaining a correspondence between the target client and the target business server. Since VIP and TSV are fixedly associated, requests with the same (CIP, VIP) can still land on the same backend gateway server.
构建出IP地址组合(cip,vip)与前端网关服务器、后端网关服务器之间的映射关系后,故障检测服务器可以基于该映射关系,对网关集群中的网关服务器进行抖动性故障检测。即,当故障检测服务器根据探测请求对应的打点信息确定探测请求包对应的传输链路不完整时,可以根据该传输链路中缺失的打点信息、探测请求包中包括的IP地址组合和上述映射关系,确定该探测请求包对应的目的网关服务器,并相应地更新该目的网关服务器对应的丢包次数。当目的网关服务器对应的丢包次数达到预设丢包次数阈值时,确定该目的网关服务器存在抖动性丢包故障。After establishing the mapping relationship between IP address combinations (CIP, VIP) and front-end and back-end gateway servers, the fault detection server can perform jitter fault detection on the gateway servers in the gateway cluster based on this mapping relationship. Specifically, when the fault detection server determines that the transmission link corresponding to the probe request packet is incomplete based on the tracking information of the probe request, it can determine the destination gateway server corresponding to the probe request packet based on the missing tracking information in the transmission link, the IP address combinations included in the probe request packet, and the aforementioned mapping relationship, and update the packet loss count corresponding to that destination gateway server accordingly. When the packet loss count corresponding to the destination gateway server reaches a preset packet loss threshold, it is determined that the destination gateway server has a jitter packet loss fault.
示例性的,仍以图2所示的传输链路为例,假设该探测请求包对应的打点信息中缺失trace_point=TP4,则故障检测服务器可以确定该探测请求包对应的传输链路不完整,由于缺失的是网关服务器接收到第二请求包后应当上传的打点信息,因此,可以确定属于后端网关服务器丢包。在该种情况下,故障检测服务器可以先确定该探测请求包中包括的IP地址组合(cip,vip),然后确定包括该IP地址组合的映射关系,将该映射关系中的后端网关服务器确定该探测请求包对应的目的网关服务器,即表示该第二请求包应当传输至该目的网关服务器,但是实际上该目的网关服务器并未接收到该第二请求包,发生了抖动丢包的情况。进而,故障检测服务器可以记录该目的网关服务器对应的丢包次数加1。若网关集群中存在某台网关服务器对应的丢包次数达到预设丢包次数阈值,则可以确定该网关服务器存在抖动性丢包故障。For example, taking the transmission link shown in Figure 2 as an example, assuming that the trace point information corresponding to the probe request packet is missing trace_point=TP4, the fault detection server can determine that the transmission link corresponding to the probe request packet is incomplete. Since the missing information is the trace point information that the gateway server should upload after receiving the second request packet, it can be determined that the packet loss belongs to the backend gateway server. In this case, the fault detection server can first determine the IP address combination (cip, vip) included in the probe request packet, then determine the mapping relationship including the IP address combination, and determine the backend gateway server in the mapping relationship as the destination gateway server corresponding to the probe request packet. That is, it means that the second request packet should be transmitted to the destination gateway server, but in fact, the destination gateway server did not receive the second request packet, resulting in jitter packet loss. Furthermore, the fault detection server can increment the packet loss count corresponding to the destination gateway server by 1. If the packet loss count of a gateway server in the gateway cluster reaches the preset packet loss count threshold, it can be determined that the gateway server has a jitter packet loss fault.
需要说明的是,由于现网环境是实时变换的,如果网关集群中的某台网关服务器突然下线,或者网关集群中突然上线一台网关服务器,那么故障检测服务器此前学习到的IP地址组合与前端网关服务器、后端网关服务器之间的映射关系也会相应地发生改变,换言之,此前学习到的映射关系将不能继续应用在当前的网络环境中,如果继续应用该映射关系,将可能出现抖动性丢包误检测的情况。为此,本申请实施例还提供了一种检测映射关系是否有效的方案。It should be noted that, because the network environment changes in real time, if a gateway server in the gateway cluster suddenly goes offline, or if a gateway server suddenly comes online, the mapping relationship between the IP address combinations previously learned by the fault detection server and the front-end and back-end gateway servers will change accordingly. In other words, the previously learned mapping relationship can no longer be applied to the current network environment. If this mapping relationship continues to be applied, jitter-induced packet loss and false detections may occur. Therefore, this application also provides a scheme for detecting whether the mapping relationship is valid.
即,故障检测服务器根据其接收的探测请求包对应的打点信息,确定该探测请求包经过的前端网关服务器和后端网关服务器,该探测请求包包括第一请求包和第二请求包,该第一请求包是客户端通过前端网关服务器向业务服务器发送的请求包,该第二请求包是业务服务器接收到第一请求包后,通过后端网关服务器向客户端反馈的请求包。然后,根据该探测请求包中包括的目标IP地址组合,确定包括该目标IP地址组合的目标映射关系;进而,判断该探测请求包经过的前端网关服务器与该目标映射关系中的前端网关服务器是否一致,以及判断该探测请求包经过的后端网关服务器与目标映射关系中的后端网关服务器是否一致,若存在任一项或多项不一致,则可以确定目标映射关系失效,需要更新该目标映射关系。In other words, the fault detection server determines the front-end gateway server and back-end gateway server that the probe request packet passes through based on the tracking information corresponding to the received probe request packet. The probe request packet includes a first request packet and a second request packet. The first request packet is a request packet sent by the client to the business server through the front-end gateway server, and the second request packet is a request packet sent by the business server to the client through the back-end gateway server after receiving the first request packet. Then, based on the target IP address combination included in the probe request packet, the target mapping relationship including the target IP address combination is determined. Furthermore, it is determined whether the front-end gateway server that the probe request packet passes through is consistent with the front-end gateway server in the target mapping relationship, and whether the back-end gateway server that the probe request packet passes through is consistent with the back-end gateway server in the target mapping relationship. If any one or more inconsistencies exist, the target mapping relationship can be determined to be invalid and needs to be updated.
具体的,故障检测服务器接收到探测请求包对应的打点信息后,可以确定该探测请求包中的第一请求包经过的前端网关服务器,以及与该探测请求包中的第二请求包经过的后端网关服务器;然后,根据该探测请求包中包括的目标IP地址组合(cip,vip),确定包括该(cip,vip)的目标映射关系;进而,判断该目标映射关系中包括的前端网关服务器与该第一请求包实际经过的前端网关服务器是否一致,以及该目标映射关系中包括的后端网关服务器与该第二请求包实际经过的后端网关服务器是否一致,若存在任意一项不一致,则说明该目标映射关系已经不适用于现在的网络环境,相应地,故障检测服务器不再继续基于该目标映射关系进行抖动性丢包故障的检测,而是基于此后接收的探测请求包对应的打点信息重新学习映射关系,直至所学习的映射关系达到稳定,再基于新学习的映射关系进行抖动性丢包故障的检测。Specifically, after receiving the tracking information corresponding to the probe request packet, the fault detection server can determine the front-end gateway server that the first request packet in the probe request packet passed through, and the back-end gateway server that the second request packet in the probe request packet passed through. Then, based on the target IP address combination (cip, vip) included in the probe request packet, it determines the target mapping relationship including the (cip, vip). Furthermore, it determines whether the front-end gateway server included in the target mapping relationship is consistent with the front-end gateway server that the first request packet actually passed through, and whether the back-end gateway server included in the target mapping relationship is consistent with the back-end gateway server that the second request packet actually passed through. If any of these are inconsistent, it means that the target mapping relationship is no longer applicable to the current network environment. Accordingly, the fault detection server will no longer continue to detect jitter packet loss faults based on the target mapping relationship, but will relearn the mapping relationship based on the tracking information corresponding to the probe request packets received thereafter, until the learned mapping relationship reaches stability, and then detect jitter packet loss faults based on the newly learned mapping relationship.
应理解,在实际应用中,为了实现容错处理,故障检测服务器也可以针对映射关系确定失效阈值n(n为大于1的整数),即在确定出n次映射关系中的前端网关服务器或者后端网关服务器,与第一请求包实际经过的前端网关服务器或者第二请求包实际经过的后端网关服务器不一致的情况下,再确定该映射关系失效。It should be understood that in practical applications, in order to achieve fault tolerance, the fault detection server can also determine a failure threshold n (n is an integer greater than 1) for the mapping relationship. That is, if the front-end gateway server or back-end gateway server in the n-times mapping relationship is inconsistent with the front-end gateway server that the first request packet actually passed through or the back-end gateway server that the second request packet actually passed through, then the mapping relationship is determined to be invalid.
在一些实施例中,故障检测服务器还可以检测网关服务器中是否存在异常的业务规则。具体的,网关服务器上报的打点信息可以包括:网关服务器根据对于探测请求包进行丢包操作时生成的丢包记录信息,该丢包记录信息中包括探测请求包的丢包原因;故障检测服务器获取到此类丢包记录信息后,可以根据这些丢包记录信息,分析网关服务器是否存在异常的业务规则。In some embodiments, the fault detection server can also detect whether there are abnormal business rules in the gateway server. Specifically, the tracking information reported by the gateway server may include: packet loss record information generated by the gateway server when performing packet loss operation on probe request packets, which includes the reason for the packet loss; after obtaining such packet loss record information, the fault detection server can analyze whether there are abnormal business rules in the gateway server based on this packet loss record information.
下面先结合图5,对网关服务器中主转发程序umod的转发流程进行介绍。如图5所示,第一请求包到达网关服务器中的网卡后,会通过RSS(Receive Side Scaling)均匀地分发到网卡的多个RX队列中,各个RX队列分别对应umod程序中的一个RX线程,RX线程接收到第一请求包后,根据该第一请求包中包括的(cip,vip),将该第一请求包哈希到负责业务处理的TX线程,TX线程对该第一请求包进行合法性检查、转发规则匹配、过载限速安全组检查、封包等处理,然后将处理后的第一请求包发送到与该TX线程绑定的TX队列上,进而,通过该TX队列将第一请求包发送给业务服务器,如此完成对于第一请求包的转发。应理解,网关服务器对于第二请求包的处理方式实际上与上述对于第一请求包的处理方式类似,只存在传输方向不同的区别而已。The forwarding process of the main forwarding program umod in the gateway server will be described below with reference to Figure 5. As shown in Figure 5, after the first request packet arrives at the network card in the gateway server, it is evenly distributed to multiple RX queues of the network card via RSS (Receive Side Scaling). Each RX queue corresponds to an RX thread in the umod program. After receiving the first request packet, the RX thread hashes the first request packet to the TX thread responsible for business processing based on the (cip, vip) included in the first request packet. The TX thread performs legality checks, forwarding rule matching, overload rate limiting security group checks, packet encapsulation, and other processing on the first request packet. Then, it sends the processed first request packet to the TX queue bound to the TX thread, and then sends the first request packet to the business server through the TX queue, thus completing the forwarding of the first request packet. It should be understood that the gateway server's processing method for the second request packet is actually similar to the processing method for the first request packet, with only the transmission direction being different.
基于本申请实施例提供的方法,网关服务器可以对探测请求包(包括第一请求包和第二请求包)的处理操作进行记录,类似于医学上利用同位素和荧光染色来定位病原,基于请求包对应的清晰的传输链路信息,清楚直接地定位故障位置。Based on the method provided in the embodiments of this application, the gateway server can record the processing operations of the detection request packets (including the first request packet and the second request packet), similar to the use of isotopes and fluorescent staining in medicine to locate pathogens. Based on the clear transmission link information corresponding to the request packet, the fault location can be clearly and directly located.
具体的,在本申请实施例提供的方法中,可以改造网关服务器中的umod程序,使其对于请求包的各个处理环节进行打点记录,如果一个探测请求包被丢掉,则通过drop_point记录丢包记录信息,并启动数据收集线程,将所记录的丢包记录信息传输至故障检测服务器。进而,故障检测服务器可以定期对其收集的丢包记录信息进行分析,以实现对于异常业务规则的全方位检测定位。Specifically, in the method provided in this application embodiment, the umod program in the gateway server can be modified to record each processing stage of the request packet. If a probe request packet is lost, the packet loss record information is recorded via drop_point, and a data collection thread is started to transmit the recorded packet loss record information to the fault detection server. Furthermore, the fault detection server can periodically analyze the collected packet loss record information to achieve comprehensive detection and localization of abnormal business rules.
示例性的,可以在网关服务器的转发逻辑中定义如下丢包记录信息:For example, the following packet loss record information can be defined in the forwarding logic of the gateway server:
//RX//RX
DP_PORT_RX_MBUF_EMPTY=100,//RXmbuf为空DP_PORT_RX_MBUF_EMPTY = 100, //RXmbuf is empty
DP_PORT_RX_ENQUEUE_FAIL=105,//RX出队列失败DP_PORT_RX_ENQUEUE_FAIL = 105, // RX dequeue failed
//TX before sched//TX before sched
DP_VIP_NOT_EXIST=200,//VIP不存在DP_VIP_NOT_EXIST = 200, //VIP does not exist
DP_RULE_NOT_EXIST=205,//规则不存在DP_RULE_NOT_EXIST = 205, // Rule does not exist
DP_INVALID_VLANID=210,//VLANID非法DP_INVALID_VLANID = 210, // VLANID is invalid
DP_MARTIAN_SOURCE=215,//非法报文DP_MARTIAN_SOURCE = 215, // Illegal message
DP_NOT_SYN_SCHED=220,//TPC,无连接表,收到非syn报文DP_NOT_SYN_SCHED = 220, //TPC, no connection table, received non-SYN packets
DP_INVALID_RS_TYPE=225,//RS类型错误DP_INVALID_RS_TYPE = 225, // RS type error
DP_DEST_UNAVAILABLE=230,//RS不可用DP_DEST_UNAVAILABLE = 230, //RS is unavailable
DP_DEST_OFFLINE=235,//RS不在线DP_DEST_OFFLINE = 235, //RS is offline
DP_SESSION_CREATE_FAIL=240,//创建session失败DP_SESSION_CREATE_FAIL = 240, //Session creation failed
//TX limit and flow control//TX limit and flow control
DP_FLOW_OVERLOAD=300,//流量过载DP_FLOW_OVERLOAD = 300, //Flow overload
DP_CONN_OVERLOAD=305,//连接数超限DP_CONN_OVERLOAD = 305, // Connection limit exceeded
DP_CONN_CREATE_FAIL=310,//新建连接失败DP_CONN_CREATE_FAIL = 310, // Connection creation failed
DP_IN_BLACKLIST=315,//请求在黑名单中DP_IN_BLACKLIST = 315, // Request to be added to the blacklist
DP_NOT_IN_WHITHLIST=320,//请求不在白名单中DP_NOT_IN_WHITHLIST = 320, // The request is not in the whitelist
DP_FLOW_LIMIT=325,//流量限速DP_FLOW_LIMIT = 325, // Traffic rate limiting
DP_CTRL_DROP=330,//辅助进程丢包,废弃,后续版本会删除DP_CTRL_DROP = 330, // Packet loss in auxiliary processes, deprecated, will be removed in later versions.
DP_CTRL_FREE_CONN=335,//辅助进程丢弃连接,废弃,后续版本会删除DP_CTRL_FREE_CONN = 335, // Auxiliary processes drop connections, deprecated, will be removed in later versions.
//TX packet encap封包相关//TX packet encap related
DP_MANGLE_INNER_FAIL=400,//修改报文失败DP_MANGLE_INNER_FAIL = 400, // Message modification failed.
DP_TUNNEL_ENCAP_FAIL=405,//封装隧道头失败DP_TUNNEL_ENCAP_FAIL = 405, // Encapsulation of tunnel header failed.
DP_ROUTE_LOOKUP_FAIL=410,//路由查找失败DP_ROUTE_LOOKUP_FAIL = 410, // Route lookup failed
//TX xmit发包相关//TX xmit packet distribution related
DP_XMIT_FAIL=500,//发包失败DP_XMIT_FAIL = 500, // Packet sending failed
应理解,在实际应用中,除了可以定义上述丢包记录信息外,还可以根据实际需求定义其它丢包记录信息,本申请在此不对所定义的丢包记录信息做任何限定。It should be understood that in practical applications, in addition to defining the above-mentioned packet loss record information, other packet loss record information can be defined according to actual needs. This application does not impose any limitations on the defined packet loss record information.
相比相关技术中通过在网关服务器上部署抓包程序,根据网关服务器的抓包结果定位异常的业务规则的实现方式,本申请实施例提供的定位异常业务规则的方式,能够根据网关服务器上传的丢包打点信息全方位地检测业务规则,有效地提高异常业务规则的定位效率和定位准确性,Compared to related technologies that rely on deploying packet capture programs on gateway servers to locate abnormal business rules based on the packet capture results, the method for locating abnormal business rules provided in this application can comprehensively detect business rules based on packet loss tracking information uploaded by the gateway server, effectively improving the efficiency and accuracy of locating abnormal business rules.
可选的,考虑到客户端和业务服务器通常基于TCP协议进行通信,客户端和业务服务器经过三次握手后,业务服务器会感知到通信的建立,并为客户端相应地分配处理资源,而实际上在对网关服务器进行故障检测的过程中,业务服务器为客户端分配的处理资源并不会被客户端真正的利用,因此,为了避免业务服务器为客户端分配无用的处理资源,对业务服务器的处理资源造成不必要的浪费,本申请实施例还提供了一种在对网关服务器进行故障检测的过程中使得业务服务器无感知的方法。Optionally, considering that the client and the business server usually communicate based on the TCP protocol, after the client and the business server complete a three-way handshake, the business server will be aware of the establishment of communication and allocate processing resources to the client accordingly. However, in the process of fault detection of the gateway server, the processing resources allocated by the business server to the client will not be actually used by the client. Therefore, in order to avoid the business server allocating useless processing resources to the client and causing unnecessary waste of the business server's processing resources, this application embodiment also provides a method to make the business server unaware of the fault detection process of the gateway server.
即控制防火墙拦截客户端向业务服务器发送的第三次握手反馈信息,并控制内核向业务服务器发送响应失败消息。This means controlling the firewall to block the third handshake feedback information sent by the client to the business server, and controlling the kernel to send a response failure message to the business server.
具体的,如图6所示,客户端可以从应用层程序发起scoket连接请求,利用Linux内核协议栈TCP三次握手的过程进行探测,当与业务服务器完成前两次握手后,利用防火墙将第三次握手ack拦截掉,从而阻止TCP建联成功,由此业务服务器将不会感知到此次探测请求包,也不会为客户端分配处理资源。最后,为了防止上述半连接状态占用服务器资源,客户端可以在此次探测结束后通知内核向业务服务器发送rst,以结束整个流程。如此,整个探测过程中和网关服务器只进行了一次完整的数据交互,并且做到了业务服务器无感知,十分轻量。Specifically, as shown in Figure 6, the client can initiate a socket connection request from the application layer program, using the Linux kernel protocol stack's TCP three-way handshake process for probing. After completing the first two handshakes with the business server, the firewall blocks the third handshake (ACK), thus preventing the TCP connection from being established successfully. Therefore, the business server will not be aware of this probing request packet and will not allocate processing resources to the client. Finally, to prevent the aforementioned half-open connection state from consuming server resources, the client can notify the kernel to send an RST to the business server after the probing ends, thus terminating the entire process. In this way, the entire probing process involves only one complete data interaction with the gateway server, and the business server remains unaware of it, making it very lightweight.
上述本申请实施例提供的故障检测方法,借鉴医学上利用同位素或荧光剂染色定位病因的方式,对互联网中经过网关服务器的探测请求包进行全链路染色打点,即利用客户端和/或网关服务器根据与探测请求包相关的发包操作和收包操作生成打点信息,得到探测请求包对应的打点信息,进而,基于在故障检测周期内获取到的多个探测请求包各自对应的打点信息,对网关服务器进行故障检测,以准确检测网关服务器是否存在故障,并在检测到存在故障的情况下定位故障位置和故障原因。如此,基于探测请求包的传输链路路径,实现复杂网络环境中网关服务器的故障检测及定位。The fault detection method provided in the above-described embodiments of this application draws on the medical approach of using isotope or fluorescent staining to locate the cause of disease. It performs end-to-end staining and marking on probe request packets passing through a gateway server on the Internet. Specifically, the client and/or gateway server generate marking information based on packet sending and receiving operations related to the probe request packets, obtaining the marking information corresponding to each probe request packet. Then, based on the marking information corresponding to multiple probe request packets obtained within the fault detection period, fault detection is performed on the gateway server to accurately detect whether a fault exists, and if a fault is detected, to locate the fault location and cause. Thus, based on the transmission link path of the probe request packets, fault detection and location of gateway servers in complex network environments are achieved.
下面介绍一种示例性的上述故障检测方法所应用的故障检测系统,该故障检测系统也可以被称为染色系统。如图7所示,该染色系统包括探测客户端(可以由服务器作为探测客户端)、日志收集存储系统、日志转发服务器(gw_probe_proxy)、日志分析服务器(gw_probe_brain)和运营系统。其中,日志收集存储系统用于收集存储网关服务器记录的打点信息;日志转发服务器(gw_probe_proxy)用于将探测客户端记录的打点信息以及日志收集存储系统存储的打点信息,转发给日志分析服务器(gw_probe_brain),该日志分析服务器(gw_probe_brain)实质上即为上文中的故障检测服务器,用于执行上述故障检测方法;运营系统面向运维工作人员。该染色系统可以提供实时监管网关服务器、追溯网关服务器故障等功能,服务于网关系统的现网环境。The following describes an exemplary fault detection system used in the aforementioned fault detection method, also known as a coloring system. As shown in Figure 7, this coloring system includes a probe client (which can be a server acting as the probe client), a log collection and storage system, a log forwarding server (gw_probe_proxy), a log analysis server (gw_probe_brain), and an operations system. The log collection and storage system collects and stores the logging information recorded by the gateway server. The log forwarding server (gw_probe_proxy) forwards the logging information recorded by the probe client and stored in the log collection and storage system to the log analysis server (gw_probe_brain), which is essentially the fault detection server mentioned above, used to execute the aforementioned fault detection method. The operations system is for maintenance personnel. This coloring system can provide real-time monitoring of the gateway server and trace gateway server faults, serving the current network environment of the gateway system.
对于探测客户端和探测点的选择,由于外网交换机和内网交换机与网关集群中的多台网关服务器均是负载均衡的关系,因此,无论是入包方向还是回包方向,均要保证对于网关集群的拨测能够覆盖该网关集群的所有网关服务器,通常情况下,网关集群中可以包括4到8台网关服务器。Regarding the selection of probe clients and probe points, since the external network switches and internal network switches are all load-balanced with the multiple gateway servers in the gateway cluster, it is necessary to ensure that the probes to the gateway cluster can cover all the gateway servers in the gateway cluster, whether in the direction of incoming packets or outgoing packets. Typically, the gateway cluster can include 4 to 8 gateway servers.
基于此,对于入包方向,可以采用30台探测服务器分3地部署的方式,在每个地区各部署10台探测服务器作为一组,全部采用cap(Consistency,Availability,Partitiontolerance)网络,以排除运营商网络质量、单点拨测机故障对拨测结果产生影响。对于回包方向,每个网关集群均设置专门用于拨测的规则,并在这些规则后面挂接20到30台业务服务器,分布于多地机房,以保证业务服务器的回包可以覆盖到网关集群中的所有网关服务器。Based on this, for inbound packets, 30 probe servers can be deployed across 3 locations, with 10 probe servers deployed in each location as a group. All servers should use a CAP (Consistency, Availability, Partition tolerance) network to eliminate the impact of carrier network quality and single-point probe server failures on the probe test results. For outbound packets, each gateway cluster should have dedicated rules for probe testing, and 20 to 30 service servers should be connected behind these rules, distributed across multiple data centers, to ensure that outbound packets from the service servers can cover all gateway servers in the gateway cluster.
对于数据汇总分析,染色系统的一个主要功能是进行线上网关集群的健康性监管,因此对数据的实时性、准确性要求较高。染色打点记录首先都是通过日志的形式记录在本地的,然后利用云架构平台的日志采集功能将日志汇总到kafla缓存,由于数据源有两份(分别是客户端的数据源和网关服务器的数据源),因此,进行分析时需要将两份数据源汇聚在一次,即需要在一台机器的内存空间中完成汇聚,考虑到数据量巨大,单台机器可能无法处理过来,因此需要采用分布式的方式。具体设计以下两种方案可供选择:For data aggregation and analysis, a key function of the coloring system is to monitor the health of the online gateway cluster, thus requiring high real-time performance and accuracy of the data. Coloring records are initially stored locally as logs, then aggregated to a Kafla cache using the cloud architecture platform's log collection function. Since there are two data sources (one from the client and one from the gateway server), analysis requires converging both sources in one operation, meaning the aggregation must be completed within the memory of a single machine. Considering the massive data volume, a single machine may not be able to handle it; therefore, a distributed approach is necessary. Two specific design schemes are available:
方案一:利用Spark进行实时分析。但是这种方式存在以下问题:客户端和网关服务器记录的打点信息分属于多个kafla源,将同一拨测的所有打点信息都转发到Spark集群的同一台机器上,实现较为复杂,并且Spark作为一个庞大的数据分析系统需要巨大的维护成本和专门的人力支持,一旦出现问题,对于问题的定位极为困难。Option 1: Use Spark for real-time analysis. However, this approach has the following problems: the logging information recorded by the client and gateway server belongs to multiple Kafla sources. Forwarding all logging information from the same test to the same machine in the Spark cluster is quite complex. Furthermore, Spark, as a large data analysis system, requires huge maintenance costs and dedicated manpower support. Once a problem occurs, it is extremely difficult to locate the problem.
方案二:开发日志转发gw_probe_proxy和日志分析gw_probe_brain两个程序。如图8所示,日志转发程序gw_probe_proxy仅进行哈希转发,保证基于标识四元组将同一拨测整体链路的打点信息转发到同一日志分析程序gw_probe_brain,这样多个日志分析程序gw_probe_brain之间完全没有联系,可以实现平行扩展。日志分析程序gw_probe_brain中的插件1可以基于打点信息进行数据分析,并将数据分析结果提供给健康中心决策程序,由此在秒级内完成数据处理,保证线上监管的实时性、可用性。Option 2: Develop two programs: `gw_probe_proxy` for log forwarding and `gw_probe_brain` for log analysis. As shown in Figure 8, the log forwarding program `gw_probe_proxy` only performs hash forwarding, ensuring that the logging information for the same entire testing link is forwarded to the same log analysis program `gw_probe_brain` based on the identifier four-tuple. This ensures that multiple log analysis programs `gw_probe_brain` are completely independent, allowing for parallel scaling. Plugin 1 within the log analysis program `gw_probe_brain` can perform data analysis based on the logging information and provide the results to the health center's decision-making process. This completes data processing within seconds, ensuring the real-time nature and availability of online monitoring.
此外,日志分析程序gw_probe_brain中的插件2可以在接收打点失败时,通知日志转发程序gw_probe_proxy中的反馈/重传模块,使其重新发送没有被成功接收的打点信息。日志分析程序gw_probe_brain中的插件2还可以把汇聚后完整的链路数据发给负责离线处理的鹊方平台。Furthermore, Plugin 2 in the log analysis program gw_probe_brain can notify the feedback/retransmission module in the log forwarding program gw_probe_proxy to resend the unreceived log information when receiving log points fails. Plugin 2 in the log analysis program gw_probe_brain can also send the aggregated complete link data to the Quefang platform responsible for offline processing.
染色系统除了可以进行线上监管外,还可以作为网络问题分析工具使用,运维工作人员在定位问题时可能需要查看历史的打点信息,由于历史数据量巨大(每天可达10T+)且维度很多,利用传统查数据库建立索引的方式很难满足时间需求。为了解决该问题,本申请接入鹊方平台,如图7和图8所示,利用ES(ElasticSearch)的倒排索引实现数据快速离线查询,TB级别的数据可以在1s内查出结果。In addition to online monitoring, the coloring system can also be used as a network problem analysis tool. When locating issues, operations and maintenance personnel may need to review historical data points. Due to the massive volume of historical data (up to 10TB+ per day) and its numerous dimensions, traditional database indexing methods are insufficient to meet time requirements. To address this issue, this application integrates with the Quefang platform. As shown in Figures 7 and 8, it utilizes Elasticsearch's inverted index to achieve fast offline data retrieval; terabytes of data can be retrieved within 1 second.
在实际应用中,上述图7所示的染色系统可以达到以下效果:In practical applications, the staining system shown in Figure 7 above can achieve the following effects:
1、自动容错—秒级监管集群问题1. Automatic fault tolerance—second-level monitoring of cluster issues
探测请求包对应的打点信息每20s统计上报一次,故障检测服务器可以在1分钟内作出决策,在现网自动容错中起到了至关重要的作用,极大地减少了网关现网问题单数。以一个网关集群包括两台网关服务器(x.x.x.101和x.x.x.102)为例,在两种不同的情况下,故障检测效果分别如下:The tracking information corresponding to the probe request packets is statistically reported every 20 seconds. The fault detection server can make a decision within 1 minute, playing a crucial role in automatic fault tolerance on the live network and significantly reducing the number of gateway issues. Taking a gateway cluster consisting of two gateway servers (x.x.x.101 and x.x.x.102) as an example, the fault detection performance under two different scenarios is as follows:
当一台网关服务器完成不能转发时,如图9所示,网关集群整体转发成功率下降,基于该网关服务器上传的前端收包信息统计出的前端收包数、以及基于该网关服务器上传的后端收包信息统计出的后端收包数出现明显下降,如此可以判断该网关服务器出现故障。When a gateway server fails to forward packets, as shown in Figure 9, the overall forwarding success rate of the gateway cluster decreases. The number of packets received by the front end, calculated based on the front end packet receiving information uploaded by the gateway server, and the number of packets received by the back end, calculated based on the back end packet receiving information uploaded by the gateway server, both show a significant decrease. This indicates that the gateway server has malfunctioned.
当一台网关服务器出现抖动性丢包故障时,如图10所示,网关集群整体转发成功率下降,但是从网关服务器的收包数量上来看两台网关服务器并无并行差异,但是故障检测服务器可以通过IP路由学习,成功统计出丢掉的包全部属于x.x.x.101这台服务器,因此可以锁定该台服务器存在抖动性丢包故障。When a gateway server experiences jittery packet loss, as shown in Figure 10, the overall forwarding success rate of the gateway cluster decreases. However, in terms of the number of packets received by the two gateway servers, there is no parallel difference. But the fault detection server can successfully count that all the lost packets belong to the server x.x.x.101 through IP route learning. Therefore, it can be determined that the server has a jittery packet loss fault.
2、运营工具—离线统计辅助日常问题定位2. Operational Tools—Offline Statistics to Assist in Daily Problem Locating
除了可以基于探测请求包对应的打点信息进行网关服务器的实时监管外,还可以提供日志问题的定位,例如,运维人员可以基于染色系统提供的离线数据分析界面分析指定的探测请求包的全链路信息,从而进行业务反馈的异常问题定位。In addition to enabling real-time monitoring of gateway servers based on the tracking information corresponding to probe request packets, it can also help locate log issues. For example, operations and maintenance personnel can analyze the full-link information of specified probe request packets based on the offline data analysis interface provided by the coloring system, thereby locating abnormal issues reported by the business.
针对上文描述的故障检测方法,本申请还提供了对应的故障检测装置,以使上述故障检测方法在实际中的应用以及实现。In response to the fault detection method described above, this application also provides a corresponding fault detection device to enable the practical application and implementation of the above fault detection method.
参见图11,图11为上文图3所示的故障检测方法对应的一种故障检测装置1100的结构示意图,该故障检测装置1100包括:Referring to Figure 11, which is a structural schematic diagram of a fault detection device 1100 corresponding to the fault detection method shown in Figure 3 above, the fault detection device 1100 includes:
信息获取模块1101,用于在故障检测周期内,接收探测请求包所经过的网络设备各自上报的打点信息;所述打点信息是所述网络设备对所述探测请求包进行操作时生成的记录信息;所述网络设备包括客户端、网关服务器和业务服务器中的至少一个;The information acquisition module 1101 is used to receive, during the fault detection period, the tracking information reported by each network device through which the probe request packet passes; the tracking information is the record information generated by the network device when it operates on the probe request packet; the network device includes at least one of a client, a gateway server, and a service server.
所述信息获取模块1101,还用于将包含同一请求包标识的打点信息,作为所述请求包标识所对应的探测请求包的打点信息;The information acquisition module 1101 is further configured to use the tracking information containing the same request packet identifier as the tracking information of the probe request packet corresponding to the request packet identifier;
故障检测模块1102,用于根据所述探测请求包的打点信息,对所述网络设备进行网关故障检测。The fault detection module 1102 is used to perform gateway fault detection on the network device based on the logging information of the probe request packet.
可选的,在图11所示的故障检测装置的基础上,所述网络设备包括所述网关服务器,所述网关服务器部署在网关集群中,所述网关集群包括多台所述网关服务器;所述网关服务器上报的打点信息包括:前端收包信息和后端收包信息,所述前端收包信息是所述网关服务器对第一请求包完成收包操作时生成的记录信息,所述后端收包信息是所述网关服务器对第二请求包完成收包操作时生成的记录信息,所述第一请求包是所述客户端通过所述网关服务器向所述业务服务器发送的请求包,所述第二请求包是所述业务服务器接收到所述第一请求包后,通过所述网关服务器向所述客户端反馈的请求包,所述第二请求包与所述第一请求包中包含相同的请求包标识。参见图12,图12为本申请实施例提供的另一种故障检测装置1200的结构示意图。如图12所示,所述故障检测模块1102包括:Optionally, based on the fault detection device shown in Figure 11, the network device includes the gateway server, which is deployed in a gateway cluster, and the gateway cluster includes multiple gateway servers. The tracking information reported by the gateway server includes: front-end packet reception information and back-end packet reception information. The front-end packet reception information is the record information generated by the gateway server when it completes the packet reception operation for the first request packet, and the back-end packet reception information is the record information generated by the gateway server when it completes the packet reception operation for the second request packet. The first request packet is a request packet sent by the client to the service server through the gateway server, and the second request packet is a request packet fed back by the service server to the client through the gateway server after receiving the first request packet. The second request packet contains the same request packet identifier as the first request packet. Referring to Figure 12, Figure 12 is a structural schematic diagram of another fault detection device 1200 provided in this application embodiment. As shown in Figure 12, the fault detection module 1102 includes:
收包数统计单元1201,用于针对所述网关集群中每台所述网关服务器,根据所述网关服务器上传的所述前端收包信息和所述后端收包信息,分别统计所述网关服务器的前端收包数和后端收包数;The packet reception count counting unit 1201 is used to count the front-end packet reception count and the back-end packet reception count of each gateway server in the gateway cluster, based on the front-end packet reception information and the back-end packet reception information uploaded by the gateway server.
故障确定单元1202,用于根据所述网关集群中各台所述网关服务器的前端收包数和后端收包数,确定所述网关集群中的故障网关服务器。The fault determination unit 1202 is used to determine the faulty gateway server in the gateway cluster based on the number of packets received at the front end and the number of packets received at the back end of each gateway server in the gateway cluster.
可选的,在图12所示的故障检测装置的基础上,所述故障确定单元1202具体用于:Optionally, based on the fault detection device shown in Figure 12, the fault determination unit 1202 is specifically used for:
针对所述网关集群中每台所述网关服务器,判断所述网关服务器的前端收包数和后端收包数是否掉底,若所述前端收包数和所述后端收包数中任一项或多项掉底,则确定所述网关服务器为所述故障网关服务器。For each gateway server in the gateway cluster, determine whether the number of packets received at the front end and the number of packets received at the back end of the gateway server have dropped to zero. If any one or more of the number of packets received at the front end and the number of packets received at the back end have dropped to zero, then the gateway server is determined to be the faulty gateway server.
可选的,在图12所示的故障检测装置的基础上,所述故障确定单元1202具体用于:Optionally, based on the fault detection device shown in Figure 12, the fault determination unit 1202 is specifically used for:
根据所述网关集群中各台所述网关服务器的前端收包数和后端收包数,分别确定前端收包平均阈值和后端收包平均阈值;Based on the number of packets received by the front end and the number of packets received by the back end of each gateway server in the gateway cluster, the average threshold for front end packet reception and the average threshold for back end packet reception are determined respectively.
针对所述网关集群中每台所述网关服务器,确定所述前端收包平均阈值与所述网关服务器的前端收包数之间的第一差值,判断所述第一差值是否超过预设前端差值阈值,若是,则确定所述网关服务器为所述故障网关服务器;For each gateway server in the gateway cluster, a first difference is determined between the average front-end packet reception threshold and the number of front-end packets received by the gateway server. It is then determined whether the first difference exceeds a preset front-end difference threshold. If so, the gateway server is determined to be the faulty gateway server.
针对所述网关集群中每台所述网关服务器,确定所述后端收包平均阈值与所述网关服务器的后端收包数之间的第二差值,判断所述第二差值是否超过预设后端差值阈值,若是,则确定所述网关服务器为所述故障网关服务器。For each gateway server in the gateway cluster, a second difference is determined between the average backend packet reception threshold and the number of backend packets received by the gateway server. It is then determined whether the second difference exceeds a preset backend difference threshold. If so, the gateway server is determined to be the faulty gateway server.
可选的,在图11所示的故障检测装置的基础上,所述网络设备至少包括所述客户端和所述网关服务器;所述网关服务器部署在网关集群中,所述网关集群包括多台所述网关服务器。参见图13,图13为本申请实施例提供的另一种故障检测装置1300的结构示意图。如图13所示,所述装置还包括:Optionally, based on the fault detection device shown in Figure 11, the network device includes at least the client and the gateway server; the gateway server is deployed in a gateway cluster, and the gateway cluster includes multiple gateway servers. Referring to Figure 13, Figure 13 is a schematic diagram of another fault detection device 1300 provided in an embodiment of this application. As shown in Figure 13, the device further includes:
映射关系构建模块1301,用于获取历史探测请求包的打点信息;根据所述历史探测请求包的打点信息,确定所述历史探测请求包经过的前端网关服务器和后端网关服务器;所述历史探测请求包包括历史第一请求包和历史第二请求包,所述历史第一请求包是所述客户端通过所述前端网关服务器向所述业务服务器发送的请求包,所述历史第二请求包是所述业务服务器接收到所述历史第一请求包后,通过所述后端网关服务器向所述客户端反馈的请求包;构建所述历史探测请求包中包括的IP地址组合与所述前端网关服务器、所述后端网关服务器之间的映射关系;所述PI地址组合包括源IP地址和目的IP地址。The mapping relationship construction module 1301 is used to obtain the tracking information of historical probe request packets; based on the tracking information of the historical probe request packets, determine the front-end gateway server and back-end gateway server through which the historical probe request packets pass; the historical probe request packets include a historical first request packet and a historical second request packet, the historical first request packet being a request packet sent by the client to the business server through the front-end gateway server, and the historical second request packet being a request packet fed back by the business server to the client through the back-end gateway server after receiving the historical first request packet; constructing a mapping relationship between the IP address combinations included in the historical probe request packets and the front-end gateway server and the back-end gateway server; the IP address combinations include a source IP address and a destination IP address.
可选的,在图13所示的故障检测装置的基础上,所述映射关系构建模块1301还用于:Optionally, based on the fault detection device shown in Figure 13, the mapping relationship construction module 1301 is further used for:
控制目标客户端发送所述历史第一请求包至目标业务服务器;所述目标客户端与所述目标业务服务器之间具有对应关系。The target client is controlled to send the historical first request packet to the target business server; there is a correspondence between the target client and the target business server.
可选的,在图13所示的故障检测装置的基础上,所述故障检测模块1102具体用于:Optionally, based on the fault detection device shown in Figure 13, the fault detection module 1102 is specifically used for:
当根据所述探测请求包的打点信息确定所述探测请求包对应的传输链路不完整时,根据所述传输链路中缺失的打点信息、所述探测请求包中包括的IP地址组合和所述映射关系,确定所述探测请求包对应的目的网关服务器,并更新所述目的网关服务器对应的丢包次数;When it is determined that the transmission link corresponding to the probe request packet is incomplete based on the puncturing information of the probe request packet, the destination gateway server corresponding to the probe request packet is determined based on the missing puncturing information in the transmission link, the IP address combination included in the probe request packet, and the mapping relationship, and the packet loss count corresponding to the destination gateway server is updated.
当所述目的网关服务器对应的丢包次数达到预设丢包次数阈值时,确定所述目的网关服务器存在抖动性丢包故障。When the number of packet losses corresponding to the destination gateway server reaches a preset packet loss threshold, it is determined that the destination gateway server has a jitter packet loss fault.
可选的,在图13所示的故障检测装置的基础上,所述映射关系构建模块1301还用于:Optionally, based on the fault detection device shown in Figure 13, the mapping relationship construction module 1301 is further used for:
根据所述探测请求包的打点信息,确定所述探测请求包经过的前端网关服务器和后端网关服务器;所述探测请求包包括第一请求包和第二请求包,所述第一请求包是所述客户端通过所述前端网关服务器向所述业务服务器发送的请求包,所述第二请求包是所述业务服务器接收到所述第一请求包后,通过所述后端网关服务器向所述客户端反馈的请求包;Based on the tracking information of the probe request packet, the front-end gateway server and back-end gateway server through which the probe request packet passes are determined; the probe request packet includes a first request packet and a second request packet, wherein the first request packet is a request packet sent by the client to the business server through the front-end gateway server, and the second request packet is a request packet sent by the business server to the client through the back-end gateway server after receiving the first request packet;
根据所述探测请求包中包括的目标IP地址组合,确定包括所述目标IP地址组合的目标映射关系;Based on the target IP address combination included in the probe request packet, determine the target mapping relationship including the target IP address combination;
判断所述探测请求包经过的前端网关服务器与所述目标映射关系中的前端服务器是否一致,以及所述探测请求包经过的后端网关服务器与所述目标映射关系中的后端服务器是否一致,若存在任一项或多项不一致,则确定所述目标映射关系失效,需要更新所述目标映射关系。Determine whether the front-end gateway server through which the probe request packet passes is consistent with the front-end server in the target mapping relationship, and whether the back-end gateway server through which the probe request packet passes is consistent with the back-end server in the target mapping relationship. If any one or more of these are inconsistent, the target mapping relationship is determined to be invalid and needs to be updated.
可选的,在图11所示的故障检测装置的基础上,所述网络设备包括所述网关服务器,所述网关服务器上报的打点信息包括:所述网关服务器对所述探测请求包进行丢包操作时生成的丢包记录信息,所述丢包记录信息包括所述探测请求包的丢包原因;则所述故障检测模块1102具体用于:Optionally, based on the fault detection device shown in Figure 11, the network device includes the gateway server, and the logging information reported by the gateway server includes: packet loss record information generated by the gateway server when performing packet loss operation on the probe request packet, and the packet loss record information includes the reason for packet loss of the probe request packet; then the fault detection module 1102 is specifically used for:
根据所述网关服务器上报的所述丢包记录信息,分析所述网关服务器是否存在异常的业务规则。Based on the packet loss record information reported by the gateway server, analyze whether the gateway server has any abnormal business rules.
可选的,在图11所示的故障检测装置的基础上,所述请求包标识为四元组标识,所述四元组标识包括源IP地址、源端口、目的IP地址和目的端口;则所述信息获取模块1101具体用于:Optionally, based on the fault detection device shown in Figure 11, the request packet identifier is a four-tuple identifier, which includes the source IP address, source port, destination IP address, and destination port; then the information acquisition module 1101 is specifically used for:
在所述故障检测周期内,接收多个分别包含不同的四元组标识的打点信息。During the fault detection period, multiple dot information messages containing different quadruple identifiers are received.
可选的,在图11所示的故障检测装置的基础上,当所述客户端与所述业务服务器基于传输控制协议TCP通信时。参见图14,图14为本申请实施例提供的另一种故障检测装置1400的结构示意图。如图14所示,所述装置还包括:Optionally, based on the fault detection device shown in Figure 11, when the client and the service server communicate based on the Transmission Control Protocol (TCP). Referring to Figure 14, which is a schematic diagram of another fault detection device 1400 provided in an embodiment of this application. As shown in Figure 14, the device further includes:
通信控制模块1401,用于控制防火墙拦截所述客户端向所述业务服务器发送的第三次握手反馈信息;控制内核向所述业务服务器发送响应失败消息。The communication control module 1401 is used to control the firewall to intercept the third handshake feedback information sent by the client to the business server; and to control the kernel to send a response failure message to the business server.
上述本申请实施例提供的故障检测装置,借鉴医学上利用同位素或荧光剂染色定位病因的方式,对互联网中经过网关服务器的探测请求包进行全链路染色打点,即利用网关服务器和/或客户端根据与探测请求包相关的发包操作和收包操作生成打点信息,得到探测请求包对应的打点信息,进而,基于在故障检测周期内获取到的多个探测请求包各自对应的打点信息,对网关服务器进行故障检测,以准确检测网关服务器是否存在故障,并在检测到存在故障的情况下定位故障位置和故障原因。如此,基于探测请求包的传输链路路径,实现复杂网络环境中网关服务器的故障检测及定位The fault detection device provided in the above-described embodiments of this application, drawing on the medical approach of using isotopes or fluorescent staining to locate the cause of disease, performs end-to-end staining and marking on probe request packets passing through gateway servers in the Internet. Specifically, the gateway server and/or client generate marking information based on packet sending and receiving operations related to the probe request packets, obtaining the marking information corresponding to each probe request packet. Then, based on the marking information corresponding to each of the multiple probe request packets obtained within the fault detection period, fault detection is performed on the gateway server to accurately detect whether a fault exists, and if a fault is detected, to locate the fault location and cause. Thus, based on the transmission link path of the probe request packets, fault detection and location of gateway servers in complex network environments are achieved.
本申请实施例还提供了一种用于检测网关服务器故障的设备,该设备具体可以为服务器,下面将从硬件实体化的角度对本申请实施例提供的服务器进行介绍。This application also provides a device for detecting gateway server faults. Specifically, the device can be a server. The server provided in this application will be described below from the perspective of hardware implementation.
参见图15,图15为本申请实施例提供的一种服务器1500的结构示意图。该服务器1500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1522(例如,一个或一个以上处理器)和存储器1532,一个或一个以上存储应用程序1542或数据1544的存储介质1530(例如一个或一个以上海量存储设备)。其中,存储器1532和存储介质1530可以是短暂存储或持久存储。存储在存储介质1530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1522可以设置为与存储介质1530通信,在服务器1500上执行存储介质1530中的一系列指令操作。Referring to Figure 15, which is a schematic diagram of the structure of a server 1500 provided in an embodiment of this application, the server 1500 can vary considerably due to different configurations or performance. It may include one or more central processing units (CPUs) 1522 (e.g., one or more processors) and memory 1532, and one or more storage media 1530 (e.g., one or more mass storage devices) for storing application programs 1542 or data 1544. The memory 1532 and storage media 1530 can be temporary or persistent storage. The program stored in the storage media 1530 may include one or more modules (not shown in the figure), each module including a series of instruction operations on the server. Furthermore, the CPU 1522 may be configured to communicate with the storage media 1530 and execute the series of instruction operations in the storage media 1530 on the server 1500.
服务器1500还可以包括一个或一个以上电源1526,一个或一个以上有线或无线网络接口1550,一个或一个以上输入输出接口1558,和/或,一个或一个以上操作系统1541,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。Server 1500 may also include one or more power supplies 1526, one or more wired or wireless network interfaces 1550, one or more input/output interfaces 1558, and/or one or more operating systems 1541, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.
上述实施例中由服务器所执行的步骤可以基于该图15所示的服务器结构。The steps performed by the server in the above embodiments can be based on the server structure shown in Figure 15.
其中,CPU 1522用于执行如下步骤:CPU 1522 is used to perform the following steps:
在故障检测周期内,接收探测请求包所经过的网络设备各自上报的打点信息;所述打点信息是所述网络设备对所述探测请求包进行操作时生成的记录信息;所述网络设备包括客户端、网关服务器和业务服务器中的至少一个;During the fault detection period, the network devices that receive the probe request packet report the logging information they receive; the logging information is the record information generated by the network devices when they operate on the probe request packet; the network devices include at least one of a client, a gateway server, and a service server.
将包含同一请求包标识的打点信息,作为所述请求包标识所对应的探测请求包的打点信息;The marker information containing the same request packet identifier is used as the marker information of the probe request packet corresponding to the request packet identifier;
根据所述探测请求包的打点信息,对所述网络设备进行网关故障检测。Based on the tracking information of the probe request packet, the network device is used to detect gateway faults.
可选的,CPU 1522还可以用于执行本申请实施例提供的故障检测方法的任意一种实现方式的步骤。Optionally, the CPU 1522 can also be used to execute the steps of any implementation of the fault detection method provided in the embodiments of this application.
本申请实施例还提供一种计算机可读存储介质,用于存储计算机程序,该计算机程序用于执行前述各个实施例所述的一种故障检测方法中的任意一种实施方式。This application also provides a computer-readable storage medium for storing a computer program that executes any one of the implementation methods of the fault detection method described in the foregoing embodiments.
本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行前述各个实施例所述的一种基于故障检测方法中的任意一种实施方式。This application also provides a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform any of the implementation methods of the fault detection method described in the foregoing embodiments.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between apparatuses or units through some interfaces, and may be electrical, mechanical, or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(英文全称:Read-OnlyMemory,英文缩写:ROM)、随机存取存储器(英文全称:Random Access Memory,英文缩写:RAM)、磁碟或者光盘等各种可以存储计算机程序的介质。If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes: USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, optical disks, and other media capable of storing computer programs.
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And/or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and/or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character "/" generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims (24)
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK40036252A HK40036252A (en) | 2021-05-28 |
| HK40036252B true HK40036252B (en) | 2024-09-06 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112073234B (en) | Fault detection method, device, system, equipment and storage medium | |
| US11038744B2 (en) | Triggered in-band operations, administration, and maintenance in a network environment | |
| US10917322B2 (en) | Network traffic tracking using encapsulation protocol | |
| US9608841B2 (en) | Method for real-time synchronization of ARP record in RSMLT cluster | |
| US10079846B2 (en) | Domain name system (DNS) based anomaly detection | |
| US10033602B1 (en) | Network health management using metrics from encapsulation protocol endpoints | |
| US20160205008A1 (en) | Diagnosis and throughput measurement of fibre channel ports in a storage area network environment | |
| CN110557342B (en) | Device for analyzing and mitigating dropped packets | |
| US20120023230A1 (en) | Network topology | |
| CN114389792B (en) | WEB log NAT (network Address translation) front-back association method and system | |
| JP7228712B2 (en) | Abnormal host monitoring | |
| Plonka et al. | Assessing performance of Internet services on IPv6 | |
| CN114826646A (en) | Network abnormal behavior detection method and device and electronic equipment | |
| US9356876B1 (en) | System and method for classifying and managing applications over compressed or encrypted traffic | |
| KR20220029142A (en) | Sdn controller server and method for analysing sdn based network traffic usage thereof | |
| CN113726865B (en) | Data transmission and collaboration system based on edge calculation | |
| CN113595783B (en) | Fault positioning method, device, server and computer storage medium | |
| CN110753364B (en) | Network monitoring method, system, electronic device and storage medium | |
| HK40036252B (en) | Method, device and system for detecting fault, apparatus and storage medium | |
| US20250047650A1 (en) | Source-based capture of clear text from encrypted data traffic for network traffic visibility processing | |
| CN117499274A (en) | Traffic monitoring method, device, equipment and media based on elastic public network IP | |
| CN117155817A (en) | Network monitoring method, device, system and readable storage medium | |
| Zou et al. | Fta-detector: Troubleshooting gray link failures based on fault tree analysis | |
| CN115484193A (en) | Method, system, storage medium and device for monitoring and analyzing network packet loss flow | |
| CN115134251A (en) | Cross-border cloud internal geographic boundary discovery system and method |