[go: up one dir, main page]

CN103326897A - Distributed computing environment general monitoring device and failure detection method - Google Patents

Distributed computing environment general monitoring device and failure detection method Download PDF

Info

Publication number
CN103326897A
CN103326897A CN 201310229490 CN201310229490A CN103326897A CN 103326897 A CN103326897 A CN 103326897A CN 201310229490 CN201310229490 CN 201310229490 CN 201310229490 A CN201310229490 A CN 201310229490A CN 103326897 A CN103326897 A CN 103326897A
Authority
CN
China
Prior art keywords
connectivity
monitoring
server
module
heartbeat
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201310229490
Other languages
Chinese (zh)
Other versions
CN103326897B (en
Inventor
王苏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fiberhome Telecommunication Technologies Co Ltd
Original Assignee
Fiberhome Telecommunication Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fiberhome Telecommunication Technologies Co Ltd filed Critical Fiberhome Telecommunication Technologies Co Ltd
Priority to CN201310229490.8A priority Critical patent/CN103326897B/en
Publication of CN103326897A publication Critical patent/CN103326897A/en
Application granted granted Critical
Publication of CN103326897B publication Critical patent/CN103326897B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

一种分布式计算环境通用监测系统,包括,连通性监测系统,其包括设置于客户端上的连通性监测模块、设置于服务器上的连通性应答模块,以及连接连通性监测模块与连通性应答模块的连通性监测通道,且所述连通性监测系统用于检测网络互连环境或服务器是否连通;服务有效性监测系统,所述服务有效性监测系统包括设置于客户端上的心跳监测模块、设置于服务器上的心跳应答模块以及连接心跳监测模块与心跳应答模块的心跳监测通道,所述服务有效性监测系统用于检测服务器是否失效。避免人工干预与故障判断的低效与迟滞,充分发挥集中式中央局大型设备的能力,提高其可用性,保障运营商的投资效益。

A general monitoring system for a distributed computing environment, including a connectivity monitoring system, which includes a connectivity monitoring module set on a client, a connectivity response module set on a server, and a connection between the connectivity monitoring module and the connectivity response The connectivity monitoring channel of the module, and the connectivity monitoring system is used to detect whether the network interconnection environment or the server is connected; the service availability monitoring system, the service availability monitoring system includes a heartbeat monitoring module arranged on the client, The heartbeat response module set on the server and the heartbeat monitoring channel connecting the heartbeat monitoring module and the heartbeat response module, the service effectiveness monitoring system is used to detect whether the server fails. Avoid the inefficiency and hysteresis of manual intervention and fault judgment, give full play to the ability of large-scale equipment in the centralized central office, improve its availability, and protect the investment benefits of operators.

Description

一种分布式计算环境通用监测装置与失效检测方法A general monitoring device and failure detection method for a distributed computing environment

【技术领域】【Technical field】

本发明涉及一种分布式计算环境通用监测软件装置与失效检测方法,可适用于无人值守的命令应答处理和批量命令自动处理。The invention relates to a general monitoring software device and a failure detection method in a distributed computing environment, which are applicable to unattended command response processing and batch command automatic processing.

【背景技术】【Background technique】

对于基于网络通信的客户机-服务器软件系统,包括基于互连网的分布式客户-服务系统,以及基于机框主板的插板式主控-线卡设备,都具有客户与服务交互的网络分布式特征,例如在无源光网络(PON)宽带接入领域中,大量使用了机框插卡式设备,特别是在数据通信网与接入网不断扁平化的趋势下,局端设备因局所合并而使之要承载更多的用户接入,而业务的配置与发放都需要通过客户机向服务器下达命令及批量命令而实现的,设备开通和重启过程中需要在无人值守条件下完成批量命令执行而自动恢复先前保存的配置,这些处理过程与用户体验息息相关,直接影响着服务质量,另一方面,网管需要获取设备运行的状态,采集设备性能数据,来实现故障管理,也是通过向设备发出命令的方式实现的。机框插卡式设备需要承载更多用户,又要求保证其可用性高达99.99999%,因而设备的关键部件需有主备冗余,且支持业务板卡的插槽也需增多,以支持更多的接入用户。机框插卡式设备通常由主控板、多个线卡和辅助板卡组成,前者要求主备冗余,而主控板与业务板卡(线卡)之间就是客户机与服务器的关系,机框底板通常提供大量插槽,板卡通过网络实现互连,是一种典型的分布式计算的命令执行环境。For the client-server software system based on network communication, including the distributed client-service system based on the Internet, and the plug-in type main control-line card device based on the main board of the chassis, all have the network distributed characteristics of client and service interaction, For example, in the field of passive optical network (PON) broadband access, a large number of frame plug-in devices are used, especially under the trend of flattening data communication networks and access networks, the In order to carry more user access, the configuration and distribution of services need to be realized by issuing commands and batch commands from the client to the server. During the process of device activation and restart, batch command execution needs to be completed under unattended conditions. Automatically restore the previously saved configuration. These processes are closely related to user experience and directly affect the quality of service. On the other hand, the network management needs to obtain the running status of the device and collect device performance data to achieve fault management. It also sends commands to the device. way achieved. Frame plug-in equipment needs to carry more users, and its availability is required to be as high as 99.99999%. Therefore, the key components of the equipment need to have active and standby redundancy, and the slots supporting service boards also need to be increased to support more Access users. The frame plug-in device is usually composed of a main control board, multiple line cards and auxiliary boards. The former requires active and standby redundancy, and the relationship between the main control board and the service board (line card) is the relationship between the client and the server. , the chassis backplane usually provides a large number of slots, and the boards are interconnected through the network, which is a typical distributed computing command execution environment.

对于分布式计算环境的客户机与服务器系统,客户机向服务器发送命令来取得服务器的应答,当互连环境、网络故障或服务器失效时,会影响命令请求和应答的流程,特别是当自动配置命令批量执行过程中,会因中间命令的阻碍影响整个处理事务,因此需要一种系统框架和处理方法实现容错,保证处理流程的顺畅。For a client and server system in a distributed computing environment, the client sends a command to the server to obtain the server's response. When the interconnection environment, network failure or server failure will affect the process of command request and response, especially when automatic configuration During the batch execution of commands, the entire processing transaction will be affected due to the obstruction of intermediate commands. Therefore, a system framework and processing method are needed to achieve fault tolerance and ensure a smooth processing flow.

附图1为现有技术中命令请求-响应过程,对于各种命令交互情况,包括批量化命令处理情况,会出现命令响应处理过程延时差异很大,还有因服务模块失效而使命令响应处理过程无限延长而无应答结果,或因环境故障使应答结果无法传递,从而导致命令结果接收等待过程无限制等待,对客户模块命令交互特别是自动批量命令处理产生严重影响。若命令结果接收等待过程人为设定超时门限,以命令响应处理过程最长延时为准,则因服务模块计算环境的可变性而难以确定,同时也可能因服务模块失效而严重影响批量命令执行。从另一角度看,由于命令处理过程命令响应处理过程延时差异大,命令结果接收等待过程等待时间需按最长时间来预设,另外,网络通信通道可能出现故障,或服务模块故障失效,都会导致客户模块收不到应答。若按最长时间等待,则可能因难以判断故障而进入盲目等待:等待时间设置太短会导致客户模块忽略了服务模块的有效应答,等待时间设置过长而使服务模块的有效性与可用性判断迟滞,由此会导致系统命令处理效率降低。Figure 1 shows the command request-response process in the prior art. For various command interaction situations, including batch command processing, there will be a large difference in the delay of the command response processing process, and the command response may be caused by the failure of the service module. The processing process is infinitely extended without a response result, or the response result cannot be delivered due to environmental failures, resulting in unlimited waiting for the command result reception process, which has a serious impact on the client module command interaction, especially the automatic batch command processing. If the timeout threshold is artificially set during the waiting process of command result reception, and the longest delay in the command response processing process shall prevail, it is difficult to determine due to the variability of the computing environment of the service module, and it may also seriously affect the execution of batch commands due to the failure of the service module . From another point of view, due to the large difference in the delay in the command processing process and the command response processing process, the waiting time of the command result receiving waiting process needs to be preset according to the longest time. In addition, the network communication channel may fail, or the service module may fail. All will cause the client module not to receive the response. If you wait for the longest time, you may enter blind waiting because it is difficult to judge the fault: if the waiting time is set too short, the client module will ignore the effective response of the service module; if the waiting time is set too long, the validity and availability of the service module Hysteresis, which can result in reduced system command processing efficiency.

【发明内容】【Content of invention】

本发明的目的在于提供一种分布式计算环境通用监测软件装置与失效检测方法,其可以避免盲目等待,避免有效性与可用性判断迟滞,提高系统命令处理效率。The purpose of the present invention is to provide a general-purpose monitoring software device and failure detection method for distributed computing environments, which can avoid blind waiting, avoid hysteresis in judging validity and availability, and improve system command processing efficiency.

本发明提供一种分布式计算环境通用监测软件装置,其特征在于:其包括,The present invention provides a general monitoring software device for distributed computing environment, which is characterized in that it includes:

一连通性监测系统,所述连通性监测系统包括设置于客户端上的连通性监测模块、设置于服务器上的连通性应答模块以及连接连通性监测模块与连通性应答模块的连通性监测通道,且所述连通性监测系统用于检测网络互连环境或服务器是否连通;A connectivity monitoring system, the connectivity monitoring system comprising a connectivity monitoring module set on the client, a connectivity response module set on the server, and a connectivity monitoring channel connecting the connectivity monitoring module and the connectivity response module, And the connectivity monitoring system is used to detect whether the network interconnection environment or the server is connected;

一服务有效性监测系统,所述服务有效性监测系统包括设置于客户端上的心跳监测模块、设置于服务器上的心跳应答模块以及连接心跳监测模块与心跳应答模块的心跳监测通道,所述服务有效性监测系统用于检测服务器是否失效。A service effectiveness monitoring system, the service effectiveness monitoring system includes a heartbeat monitoring module set on the client, a heartbeat response module set on the server, and a heartbeat monitoring channel connecting the heartbeat monitoring module and the heartbeat response module, the service The availability monitoring system is used to detect whether the server is invalid.

在上述技术方案的基础上,所述连通性监测模块周期性发出连通性检测命令,若所述连通性应答模块多个周期未应答,则判断网络互连环境或服务器未连通。On the basis of the above technical solution, the connectivity monitoring module periodically sends a connectivity detection command, and if the connectivity response module does not respond for several cycles, it is judged that the network interconnection environment or the server is not connected.

在上述技术方案的基础上,所述心跳监测模块发起有效性检测命令,若所述心跳应答模块一定时间未应答,则判断服务器失效。On the basis of the above technical solution, the heartbeat monitoring module initiates a validity detection command, and if the heartbeat response module does not respond within a certain period of time, it is judged that the server is invalid.

在上述技术方案的基础上,所述分布式计算环境通用监测软件装置还设有机框互连机构。On the basis of the above-mentioned technical solution, the general-purpose monitoring software device for the distributed computing environment is also provided with a chassis interconnection mechanism.

在上述技术方案的基础上,所述机框互连机构为用于插设服务器板卡的机框背板,所述机框背板设有用于检测服务板卡是否连接的检测元件。On the basis of the above technical solution, the chassis interconnection mechanism is a chassis backplane for inserting server boards, and the chassis backplane is provided with detection elements for detecting whether the service boards are connected.

本发明还提供一种使用所述分布式计算环境通用监测软件装置的失效检测方法,所述失效检测方法包括以下两任务:The present invention also provides a failure detection method using the general monitoring software device of the distributed computing environment, and the failure detection method includes the following two tasks:

任务一:连通性监测系统周期性监测网络互连环境或服务器是否连通,服务有效性监测系统周期性监测服务器是否失效;Task 1: The connectivity monitoring system periodically monitors whether the network interconnection environment or the server is connected, and the service effectiveness monitoring system periodically monitors whether the server is invalid;

任务二:Task two:

A:客户端发出命令,为接收命令应答等待一个时间片,检查是否收到服务器应答,若收到服务器应答则继续执行命令,若未收到服务器应答,则进入步骤B;A: The client issues a command, waits for a time slice for receiving the command response, checks whether it receives the server response, and continues to execute the command if it receives the server response, and enters step B if it does not receive the server response;

B:检测相应周期内连通性应答模块是否有应答,若连通性应答模块有应答,则进入步骤C,若连通性监测系统无应答,则命令超时;B: Detect whether the connectivity response module responds within the corresponding cycle, if the connectivity response module responds, then enter step C, if the connectivity monitoring system does not respond, the command times out;

C:检测心跳应答模块是否有应答,若心跳应答模块有应答,则返回步骤A,若心跳应答模块无应答,则命令超时。C: Detect whether the heartbeat response module responds. If the heartbeat response module responds, return to step A. If the heartbeat response module does not respond, the command times out.

在上述技术方案的基础上,所述任务一与任务二并发执行。On the basis of the above technical solution, the task 1 and task 2 are executed concurrently.

在上述技术方案的基础上,命令超时时,所述客户端暂停或转移到其它冗余服务。On the basis of the above technical solution, when the command times out, the client suspends or transfers to other redundant services.

与现有技术相比,本发明通过连通性监测系统,检测网络互连环境或服务器是否连通,通过服务有效性监测系统检测服务器是否失效,可使网络互连及机框设备中无人值守的大批量配置命令和设备状态读取命令能更好地执行,快速定位故障及配套处理使命令执行的效率大大提升。避免人工干预与故障判断的低效与迟滞,充分发挥集中式中央局大型设备的能力,提高其可用性,保障运营商的投资效益。Compared with the prior art, the present invention detects whether the network interconnection environment or the server is connected through the connectivity monitoring system, and detects whether the server is invalid through the service effectiveness monitoring system, so that unattended network interconnection and chassis equipment can be Large batches of configuration commands and device status reading commands can be executed better, and rapid fault location and supporting processing greatly improve the efficiency of command execution. Avoid the inefficiency and hysteresis of manual intervention and fault judgment, give full play to the ability of large-scale equipment in the centralized central office, improve its availability, and protect the investment benefits of operators.

【附图说明】【Description of drawings】

图1为现有技术中命令请求-响应过程示意图;FIG. 1 is a schematic diagram of a command request-response process in the prior art;

图2为本发明系统架构和功能模块图;Fig. 2 is a system architecture and a functional block diagram of the present invention;

图3为本发明失效检测方法流程图。Fig. 3 is a flowchart of the failure detection method of the present invention.

【具体实施方式】【Detailed ways】

请参考图2,图2为本发明系统架构和功能模块图。客户端103和服务器104之间通过命令通道111连接。分布式计算环境通用监测软件装置包括连通性监测系统和服务有效性监测系统,所述连通性监测系统包括设置于客户端上的连通性监测模块105、设置于服务器上的连通性应答模块106以及连接连通性监测模块105与连通性应答模块106的连通性监测通道112,连通性监测系统用于检测网络互连环境或服务器是否连通。服务有效性监测系统包括设置于客户端上的心跳监测模块101、设置于服务器上的心跳应答模块102以及连接心跳监测模块101与心跳应答模块102的心跳监测通道110,服务有效性监测系统用于检测服务器是否失效。Please refer to FIG. 2 , which is a diagram of the system architecture and functional modules of the present invention. The client 103 and the server 104 are connected through a command channel 111 . The general monitoring software device in the distributed computing environment includes a connectivity monitoring system and a service availability monitoring system, and the connectivity monitoring system includes a connectivity monitoring module 105 arranged on the client, a connectivity response module 106 arranged on the server, and Connecting the connectivity monitoring module 105 and the connectivity monitoring channel 112 of the connectivity answering module 106, the connectivity monitoring system is used to detect whether the network interconnection environment or the server is connected. The service availability monitoring system includes a heartbeat monitoring module 101 arranged on the client, a heartbeat response module 102 arranged on the server, and a heartbeat monitoring channel 110 connecting the heartbeat monitoring module 101 and the heartbeat response module 102. The service availability monitoring system is used for Check if the server is down.

连通性监测模块105周期性发出连通性检测命令,若连通性应答模块106多个周期未应答,则判断网络互连环境或服务器104未连通。此时上报故障并停止命令,或客户端103转至其它冗余服务。The connectivity monitoring module 105 periodically sends a connectivity detection command, and if the connectivity response module 106 does not respond for several cycles, it is determined that the network interconnection environment or the server 104 is not connected. At this time, a fault is reported and the command is stopped, or the client 103 turns to other redundant services.

心跳监测模块101发起有效性检测命令,若所述心跳应答模块102一定时间未应答,则判断服务器104失效,停止命令,或客户端103转至其它冗余服务。The heartbeat monitoring module 101 initiates a validity detection command. If the heartbeat response module 102 does not respond within a certain period of time, it is judged that the server 104 is invalid and the command is stopped, or the client 103 switches to other redundant services.

当存在机框互连机构108时,机框互连机构108上设有检测元件。检测元件检测机框互连机构108是否正确连接。在本实施例中,机框互连机构108为机框背板,为服务器板卡104提供插槽。检测元件检测服务器104板卡是否安装。若服务器104板卡未安装,则判断服务器104失效。When the chassis interconnection mechanism 108 exists, the detection element is provided on the chassis interconnection mechanism 108 . The detecting element detects whether the frame interconnection mechanism 108 is connected correctly. In this embodiment, the chassis interconnection mechanism 108 is a chassis backplane, which provides slots for the server boards 104 . The detecting element detects whether the board card of the server 104 is installed. If the board of the server 104 is not installed, it is judged that the server 104 is invalid.

下面介绍使用本发明分布式计算环境通用监测软件装置检测进行失效检测的方法。该检测方法包括以下两并发执行的任务:The following introduces the method for failure detection by using the general monitoring software device detection of the distributed computing environment of the present invention. The detection method includes the following two concurrently executed tasks:

任务一:连通性监测系统周期性监测网络互连环境107或服务器104是否连通,服务有效性监测系统周期性监测服务器104是否失效。连通性监测系统中的连通性监测模块105周期性发出连通性检测命令,若连通性应答模块106多个周期未应答,则判断网络互连环境或服务器104未连通。Task 1: The connectivity monitoring system periodically monitors whether the network interconnection environment 107 or the server 104 is connected, and the service availability monitoring system periodically monitors whether the server 104 is invalid. The connectivity monitoring module 105 in the connectivity monitoring system periodically issues a connectivity detection command, and if the connectivity response module 106 does not respond for several cycles, it is determined that the network interconnection environment or the server 104 is not connected.

任务二:Task two:

A:客户端103发出命令后接受等待一个时间片,并检查是否收到服务器104应答,若收到服务器104应答则继续执行命令,若未收到服务器104应答,则进入步骤B;A: The client 103 accepts to wait for a time slice after issuing the command, and checks whether it receives a response from the server 104. If it receives a response from the server 104, it continues to execute the command. If it does not receive a response from the server 104, it proceeds to step B;

B:检测相应周期内连通性应答模块106是否有应答,若连通性应答模块106有应答,则进入步骤C,若连通性应答模块106无应答,则判断命令超时;B: detect whether the connectivity response module 106 has a response in the corresponding cycle, if the connectivity response module 106 has a response, then enter step C, if the connectivity response module 106 has no response, then judge that the command is overtime;

C:检测心跳应答模块102是否有应答,若心跳应答模块102有应答,则返回步骤A继续等待下一个时间片,若心跳应答模块102无应答,则命令超时。当命令超时,则可发出错告警信号而暂停命令处理,或按预设转移到其它冗余服务进入发送-接收命令的流程。C: Detect whether the heartbeat response module 102 responds. If the heartbeat response module 102 responds, return to step A and continue to wait for the next time slice. If the heartbeat response module 102 does not respond, the command times out. When the command times out, an error alarm signal can be issued to suspend the command processing, or transfer to other redundant services to enter the process of sending-receiving commands according to the preset.

本发明分布式计算环境通用监测软件装置与失效检测方法可较快速有效地检测出命令执行过程中相关的故障和失效情况,避免当连通性或服务失效时长时间超时等待所带来的资源浪费,以及不能区分故障原因的盲目维护工作。快速定位故障及配套处理使命令执行的效率大大提升。避免人工干预与故障判断的低效与迟滞,充分发挥集中式中央局大型设备的能力,提高其可用性,保障运营商的投资效益。The general monitoring software device and failure detection method of the distributed computing environment of the present invention can quickly and effectively detect relevant failures and failures in the process of command execution, and avoid resource waste caused by long time-out waiting when connectivity or services fail, And blind maintenance work that cannot distinguish the cause of the failure. Rapid fault location and supporting processing greatly improve the efficiency of command execution. Avoid the inefficiency and hysteresis of manual intervention and fault judgment, give full play to the ability of large-scale equipment in the centralized central office, improve its availability, and protect the investment benefits of operators.

Claims (8)

1.一种分布式计算环境通用监测系统,其特征在于:包括,  1. A general monitoring system for a distributed computing environment, characterized in that: comprising, 连通性监测系统,其包括设置于客户端上的连通性监测模块、设置于服务器上的连通性应答模块,以及连接连通性监测模块与连通性应答模块的连通性监测通道,且所述连通性监测系统用于检测网络互连环境或服务器是否连通;  A connectivity monitoring system, which includes a connectivity monitoring module set on the client, a connectivity response module set on the server, and a connectivity monitoring channel connecting the connectivity monitoring module and the connectivity response module, and the connectivity The monitoring system is used to detect whether the network interconnection environment or the server is connected; 服务有效性监测系统,所述服务有效性监测系统包括设置于客户端上的心跳监测模块、设置于服务器上的心跳应答模块以及连接心跳监测模块与心跳应答模块的心跳监测通道,所述服务有效性监测系统用于检测服务器是否失效。  A service validity monitoring system, the service validity monitoring system includes a heartbeat monitoring module set on the client, a heartbeat response module set on the server, and a heartbeat monitoring channel connecting the heartbeat monitoring module and the heartbeat response module, the service validity The performance monitoring system is used to detect whether the server fails. the 2.如权利要求1所述的一种分布式计算环境通用监测软件装置,其特征在于:所述连通性监测模块周期性发出连通性检测命令,若所述连通性应答模块多个周期未应答,则判断网络互连环境或服务器未连通。  2. a kind of general monitoring software device of distributed computing environment as claimed in claim 1, it is characterized in that: described connectivity monitoring module periodically sends connectivity detection order, if described connectivity response module does not respond in multiple cycles , it is judged that the network interconnection environment or the server is not connected. the 3.如权利要求1所述的一种分布式计算环境通用监测软件装置,其特征在于:所述心跳监测模块发起有效性检测命令,若所述心跳应答模块在预设时间内未应答,则判断服务器失效。  3. a kind of general monitoring software device of distributed computing environment as claimed in claim 1, it is characterized in that: described heartbeat monitoring module initiates validity detection command, if described heartbeat response module does not respond within preset time, then Judgment server failure. the 4.如权利要求1所述的一种分布式计算环境通用监测软件装置,其特征在于:所述分布式计算环境通用监测软件装置还设有机框互连机构。  4. A general-purpose monitoring software device for a distributed computing environment as claimed in claim 1, characterized in that: said general-purpose monitoring software device for a distributed computing environment is also provided with a chassis interconnection mechanism. the 5.如权利要求4所述的一种分布式计算环境通用监测软件装置,其特征在于:所述机框互连机构为用于插设服务器板卡的机框背板,,所述机框背板设有用于检测服务板卡是否连接的检测元件。  5. A kind of distributed computing environment universal monitoring software device as claimed in claim 4, it is characterized in that: the machine frame interconnection mechanism is the machine frame backplane for inserting the server board, and the machine frame The backplane is provided with a detection element for detecting whether the service board is connected. the 6.一种使用权利要求1-4任一项所述分布式计算环境通用监测软件装置的失效检测方法,其特征在于:所述失效检测方法包括以下两 任务:  6. A failure detection method using the distributed computing environment general monitoring software device described in any one of claims 1-4, characterized in that: the failure detection method comprises the following two tasks: 任务一:连通性监测系统周期性监测网络互连环境或服务器是否连通,服务有效性监测系统周期性监测服务器是否失效;  Task 1: The connectivity monitoring system periodically monitors whether the network interconnection environment or the server is connected, and the service effectiveness monitoring system periodically monitors whether the server is invalid; 任务二:  Task two: A.客户端发出命令,为接受命令答应等待一个时间片,检查是否收到服务器应答,若收到,则继续执行命令;若未收到,则进入B;  A. The client issues a command, and waits for a time slice to accept the command, and checks whether it receives a server response. If it receives it, it continues to execute the command; if it does not receive it, it enters B; B.检测相应周期内连通性应答模块是否有应答,若连通性应答模块有应答,则进入步骤C,若连通性应答模块无应答,则命令超时;  B. Detect whether the connectivity response module has a response in the corresponding period, if the connectivity response module has a response, then enter step C, if the connectivity response module does not respond, the command times out; C.检测心跳应答模块是否有应答,若心跳应答模块有应答,则返回步骤A,若心跳应答模块无应答,则命令超时。  C. Detect whether the heartbeat response module responds. If the heartbeat response module responds, return to step A. If the heartbeat response module does not respond, the command times out. the 7.如权利要求6所述的失效检测方法,其特征在于:所述任务一与所述任务二并发执行。  7. The failure detection method according to claim 6, wherein said task one and said task two are executed concurrently. the 8.如权利要求6所述的失效检测方法,其特征在于:命令超时时,所述客户端转移到其它冗余服务或暂停。  8. The failure detection method according to claim 6, wherein when the command times out, the client transfers to other redundant services or suspends. the
CN201310229490.8A 2013-06-08 2013-06-08 A kind of distributed computing environment versatile monitoring device and abatement detecting method Expired - Fee Related CN103326897B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310229490.8A CN103326897B (en) 2013-06-08 2013-06-08 A kind of distributed computing environment versatile monitoring device and abatement detecting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310229490.8A CN103326897B (en) 2013-06-08 2013-06-08 A kind of distributed computing environment versatile monitoring device and abatement detecting method

Publications (2)

Publication Number Publication Date
CN103326897A true CN103326897A (en) 2013-09-25
CN103326897B CN103326897B (en) 2016-12-28

Family

ID=49195440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310229490.8A Expired - Fee Related CN103326897B (en) 2013-06-08 2013-06-08 A kind of distributed computing environment versatile monitoring device and abatement detecting method

Country Status (1)

Country Link
CN (1) CN103326897B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105306305A (en) * 2015-11-12 2016-02-03 中国电子科技集团公司第三十研究所 Traffic data acquisition method and device for mobile wireless network
CN109257218A (en) * 2018-09-19 2019-01-22 上海电子信息职业技术学院 One kind being based on snmp protocol network system isolated island self-healing method
CN110908947A (en) * 2019-11-26 2020-03-24 杭州迪普科技股份有限公司 Hot plug method and device for frame type equipment line card, main control board and frame type equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1134135C (en) * 2000-11-22 2004-01-07 深圳市中兴通讯股份有限公司 Communication method applicable to double-network fault-tolerance system
CN100589560C (en) * 2007-06-19 2010-02-10 中兴通讯股份有限公司 Method and system for switching streaming media server
CN102664763A (en) * 2012-03-20 2012-09-12 浪潮电子信息产业股份有限公司 Method for rapidly detecting connection states and making virtual machine HA

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105306305A (en) * 2015-11-12 2016-02-03 中国电子科技集团公司第三十研究所 Traffic data acquisition method and device for mobile wireless network
CN105306305B (en) * 2015-11-12 2019-04-05 中国电子科技集团公司第三十研究所 A kind of mobile wireless network traffic data collection method and device
CN109257218A (en) * 2018-09-19 2019-01-22 上海电子信息职业技术学院 One kind being based on snmp protocol network system isolated island self-healing method
CN110908947A (en) * 2019-11-26 2020-03-24 杭州迪普科技股份有限公司 Hot plug method and device for frame type equipment line card, main control board and frame type equipment

Also Published As

Publication number Publication date
CN103326897B (en) 2016-12-28

Similar Documents

Publication Publication Date Title
CN103543961B (en) PCIe-based storage extension system and method
CN114442787B (en) Method and system for implementing whole machine power consumption callback after server enters power consumption cap
CN106844162A (en) Storage server cabinet management system and method based on BMC
CN103475696A (en) System and method for monitoring state of cloud computing cluster server
CN103326897B (en) A kind of distributed computing environment versatile monitoring device and abatement detecting method
CN110413435A (en) A communication fault recovery method, system and related components
CN109766110A (en) A kind of control method, baseboard management controller and control system
CN103178977A (en) Computer system and boot management method of computer system
CN101000568A (en) Method for preventing bus fault, communication equipment and bus monitoring device
CN105897492A (en) Cloud data center monitoring system
CN201846346U (en) Dual-redundancy heat switching system of controller area network (CAN) bus
CN108459984A (en) A kind of cabinet I2C buses deadlock treatment method, system, medium and equipment
CN110244638B (en) Data monitoring device and method
CN102819474A (en) Test method and device for system operation
CN109995597A (en) A kind of network equipment failure processing method and processing device
CN214278888U (en) Distributed communication bus system reset circuit
CN114816020A (en) GD32 single-chip microcomputer-based PMBUS interface power board and BMC control method thereof
US20160364356A1 (en) Micro server and switch device thereof
CN102932196B (en) A kind of detection method of hosting system status and device
CN115801640B (en) Mutual keep-alive system between BMC management board and network switch board based on ARM array server
CN106649007A (en) Trusted verification method based on Loongson 3A system
CN111061597B (en) A method to test the stability of KCS communication
CN111767163B (en) A method and system for preventing application server from downtime due to excessive concurrent transactions
CN104793974A (en) Method for starting system and computer system
CN103973480B (en) Improve the device and method of cloud computing system user reponding time

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161228