JP3325785B2

JP3325785B2 - Computer failure detection and recovery method

Info

Publication number: JP3325785B2
Application number: JP28477796A
Authority: JP
Inventors: 俊之木村
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1996-10-28
Filing date: 1996-10-28
Publication date: 2002-09-17
Anticipated expiration: 2016-10-28
Also published as: JPH10133963A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は計算機の故障検出・
回復方式、特にネットワーク接続された計算機において
発生した故障を高信頼度で検出するための方式に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for detecting a failure of a computer.
The present invention relates to a recovery method, and more particularly to a method for detecting a failure occurring in a computer connected to a network with high reliability.

【０００２】[0002]

【従来の技術】従来から、処理の内容によっては一時的
なシステムダウンは許すものの即座にリブートさせシス
テムの稼働を続けて行わせなくてはならない場合があ
る。システムダウンの原因としては、物理的な装置の故
障の他にもソフトウェア的な障害などリブートすること
によってシステムの復旧が即座にできるような場合も少
なくない。2. Description of the Related Art Conventionally, depending on the contents of processing, there is a case where a temporary system shutdown is permitted, but it is necessary to immediately reboot the system to continue the operation of the system. As a cause of the system down, in many cases, the system can be immediately restored by rebooting such as a software failure in addition to a physical device failure.

【０００３】ところで、同一ネットワーク上で動作する
複数の計算機、例えばそれぞれパーソナルコンピュータ
（ＰＣ）で構築された複数のサーバを含むシステムにお
いては、サーバそれぞれに自己並びに相互の故障検出機
能を持たせ、稼働中に発生したソフトウェア的な障害
（故障）をリアルタイムに検出できるようにし、故障し
停止したサーバを即座に再起動できるようにしている。[0003] In a system including a plurality of computers operating on the same network, for example, a plurality of servers each constructed by a personal computer (PC), each server has its own and mutual failure detection functions, and is operated. Software failures (failures) occurring during the detection can be detected in real time, and the failed and stopped server can be immediately restarted.

【０００４】図７は、従来からあるＰＣサーバの故障検
出方式を示した図である。ネットワーク上の各ＰＣサー
バ１００，２００には、故障検出機能を実現するために
故障検出用の専用ボードが搭載され、それぞれを専用線
１５０で接続する。この故障検知専用ボード１４０，２
４０は、搭載されたＰＣサーバ１００，２００が故障し
たことを検出すると、その旨を他の故障検知専用ボード
１４０，２４０に専用線１５０を介して通知する。ま
た、故障検知専用ボード１４０，２４０は、他の故障検
知専用ボード２４０，１４０と定期的に通信を行うこと
によって他のＰＣサーバ２００，１００の動作を相互に
監視しており、前述した他の故障検知専用ボード２４
０，１４０からの通知を受信したときや他の故障検知専
用ボード２４０，１４０との通信が不能となったことを
検知することによって他のＰＣサーバ２００，１００の
故障を検出することができる。このように、各故障検知
専用ボード１４０，２４０は、搭載されたＰＣサーバ１
００，２００の故障を検知すると、リブート指示を出し
てＰＣサーバ１００，２００を再起動させる。FIG. 7 is a diagram showing a conventional failure detection method for a PC server. Each of the PC servers 100 and 200 on the network is equipped with a dedicated board for failure detection in order to realize a failure detection function. This failure detection dedicated board 140,2
When detecting that the mounted PC servers 100 and 200 have failed, the 40 notifies other failure detection boards 140 and 240 of the failure via the dedicated line 150. Further, the failure detection dedicated boards 140 and 240 mutually monitor the operation of the other PC servers 200 and 100 by periodically communicating with the other failure detection dedicated boards 240 and 140. Failure detection dedicated board 24
The failure of the other PC servers 200, 100 can be detected by receiving a notification from the PC server 200, 100 when the notification from the PC server 200, 100 is received, or by detecting that communication with the other dedicated board for failure detection 240, 140 is disabled. As described above, each of the failure detection boards 140 and 240 is provided with the PC server 1 mounted therein.
When the failure of 00 or 200 is detected, a reboot instruction is issued and the PC servers 100 and 200 are restarted.

【０００５】従って、例えば、ＰＣサーバに稼働系、待
機系を設定して運用するようなシステムにおいて、稼働
系がダウンしたときには、その旨を待機系に伝えること
ができるので、動作主体を待機系への自動切換えをリア
ルタイムに行うとともにダウンした稼働系を即座にリブ
ートすることができる。また、待機系が故障したときで
もリブートさせることによって故障の回復を行い、再度
待機させることができる。Therefore, for example, in a system in which an active system and a standby system are set and operated in a PC server, when the active system goes down, it is possible to inform the standby system of the fact that the operating system is down. Automatic switching to real-time, and the operating system that has failed can be immediately rebooted. In addition, even when the standby system has failed, the failure can be recovered by rebooting, and the standby system can be again on standby.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、従来で
は、故障検出機能を専用ボードのみを用いて実現してい
たので、専用ボードが故障してしまうと、ＰＣサーバで
発生した故障を検出できなくなってしまう。また、専用
ボードの誤動作により故障の誤認をしてしまう場合もあ
り得る。これを解消するために専用ボードの高信頼性を
追求すると、その製造コストが増大してしまうことにな
る。従って、故障検出機能を発揮させるために、その信
頼性の向上を図りつつもコストの増大を極力抑えられる
ことが望まれる。However, conventionally, the failure detection function has been realized using only the dedicated board, and if the dedicated board breaks down, the failure occurring in the PC server cannot be detected. I will. In addition, a malfunction may be erroneously recognized due to a malfunction of the dedicated board. If high reliability of the dedicated board is pursued to solve this problem, the manufacturing cost will increase. Therefore, in order to perform the failure detection function, it is desired that the increase in cost be suppressed as much as possible while improving the reliability.

【０００７】本発明は以上のような問題を解決するため
になされたものであり、その目的は、故障と判定した計
算機に再起動をさせることによって自動的に回復させる
計算機の故障検出・回復方式を提供することにある。SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to provide a computer failure detection / recovery method in which a computer determined to have failed is automatically recovered by restarting. Is to provide.

【０００８】[0008]

【課題を解決するための手段】以上のような目的を達成
するために、本発明における計算機の故障検出・回復方
式は、ネットワーク接続されている少なくとも３台の計
算機のうち複数台を故障検知対象計算機とし、少なくと
も１台を前記故障検知対象計算機において発生した故障
を検出する監視計算機とするシステムにおいて、前記故
障検知対象計算機は、独立して動作可能であり、搭載さ
れた前記故障検知対象計算機の故障を検知する自己故障
検知手段と、ネットワークを介して他の前記計算機と通
信を行うネットワーク通信手段と、指示により自己の再
起動をする再起動手段とを有し、前記監視計算機は、独
立して動作可能であり、専用線を介して接続された他の
前記故障検知対象計算機の故障を検知する故障検知手段
と、ネットワークを介して他の前記計算機と通信を行う
ネットワーク通信手段と、前記システム内における故障
の発生を監視するとともに故障が検出された前記故障検
知対象計算機の再起動処理を行う稼働状態監視制御手段
とを有し、前記故障検知手段経由で故障を検出した前記
稼働状態監視制御手段は、ネットワーク経由で故障が検
出された前記故障検知対象計算機との通信を試行し、故
障が検出された前記故障検知対象計算機から応答がなか
った場合は、前記故障検知手段経由で故障が検出された
前記故障検知対象計算機の再起動指示を専用線経由で行
い、当該応答があった場合は、故障が検出された前記故
障検知対象計算機に搭載された前記自己故障検知手段の
故障と判定し、ネットワーク経由で故障が検出された前
記故障検知対象計算機の再起動指示を行うことを特徴と
する。In order to achieve the above object, a failure detection / recovery method for a computer according to the present invention uses a plurality of computers among at least three computers connected to a network as failure detection targets. A computer, wherein at least one of the failure detection target computers is a monitoring computer that detects a failure that has occurred in the failure detection target computer, wherein the failure detection target computer is independently operable, Self-failure detecting means for detecting a failure, network communication means for communicating with the other computers via a network, and restart means for restarting itself according to instructions, the monitoring computer is independent operable Te, a failure detection means for detecting the other of said fault detection target computer failure connected via a dedicated line network Network communication means for communicating with the other computers via the network, and operating state monitoring and control means for monitoring the occurrence of a failure in the system and restarting the failure detection target computer in which the failure is detected. The operating state monitoring control unit that has detected a failure via the failure detection unit attempts communication with the failure detection target computer that has detected the failure via a network, and the failure detection target computer that has detected the failure. If there is no response from the above, the restart instruction of the failure detection target computer in which the failure is detected via the failure detection means is performed via a dedicated line , and if the response is received, the failure is detected. It is determined that the self-failure detection means mounted on the detected failure detection target computer has failed, and the restart detection command of the failure detection target computer that has detected the failure via the network. Is provided.

【０００９】また、前記故障検知対象計算機は、前記監
視計算機でもあることを特徴とする。Further, the failure detection target computer is also the monitoring computer.

【００１０】また、ネットワーク接続されている少なく
とも３台の計算機のうち複数台を故障検知対象計算機と
するシステムにおいて、前記故障検知対象計算機は、独
立して動作可能であり、搭載された前記故障検知対象計
算機の故障並びに専用線を介して接続された他の前記故
障検知対象計算機の故障を検知する故障検知手段と、ネ
ットワークを介して他の前記計算機と通信を行うネット
ワーク通信手段と、指示により自己の再起動をする再起
動手段と、前記システム内における故障の発生を監視す
るとともに故障が検出された前記故障検知対象計算機の
再起動処理を行う稼働状態監視制御手段とを有し、前記
ネットワーク経由で故障を検出した前記故障検知対象計
算機の前記稼働状態監視制御手段は、ネットワーク経由
で故障していない他の前記計算機との通信を試行し、当
該計算機から応答があった場合は、前記故障検知手段経
由で故障が検出された前記故障検知対象計算機の再起動
指示を専用線経由で行い、当該応答がなかった場合は、
自己が搭載する前記ネットワーク通信手段の故障と判定
し、自己の再起動を行うことを特徴とする。In a system in which a plurality of computers among at least three computers connected to a network are used as failure detection target computers, the failure detection target computers are independently operable, and the mounted failure detection target computers are provided. Failure detection means for detecting a failure of the target computer and a failure of another of the failure detection target computers connected via a dedicated line ; a network communication means for communicating with the other computers via a network; Restart means for restarting the system, and operating state monitoring control means for monitoring the occurrence of a failure in the system and performing a restart process of the failure detection target computer in which the failure has been detected, via the network. The operating state monitoring control means of the failure detection target computer that has detected a failure in The attempts to communicate with computers, if there is a response from the computer, perform restart indication of the failure detection unit the failure detection target computer failure is detected via the via dedicated lines, is the response If not,
It is characterized in that it is determined that the network communication means mounted therein has failed, and the self-restart is performed.

【００１１】上記発明によれば、故障検知手段を用いる
以外に既存のネットワークを有効利用することにより、
故障検出機能を実現するようにした。すなわち、故障を
検出するための経路の二重化を図るようにした。但し、
本発明においては、単なる構成の二重化ではなく既存の
構成を有効に利用した異なる経路による二重化を図るよ
うにしている。これにより、コストの増大を抑止しつつ
故障検出機能の信頼度を向上させることができる。[0011] According to the above invention, the existing network is effectively used in addition to using the failure detecting means.
A failure detection function has been realized. That is, the path for detecting a failure is duplicated. However,
In the present invention, duplication is performed not by mere duplication of the configuration but by different routes that effectively use the existing configuration. Thereby, the reliability of the failure detection function can be improved while suppressing an increase in cost.

【００１２】また、前記故障検知対象計算機の稼働状態
を保持する稼働状態保持手段を有することを特徴とす
る。[0012] Further, there is provided an operating state holding means for holding an operating state of the failure detection target computer.

【００１３】更に、故障を検出した前記故障検知対象計
算機が搭載する前記稼働状態監視制御手段は、前記稼働
状態保持手段が保持する稼働状態に応じて再起動指示の
実行制御をすることを特徴とする。すなわち、稼働状態
保持手段を参照することによって再起動の指示が既に出
ているかを把握することができるので、無駄な再起動の
指示を行う必要がなくなる。Further, the operating state monitoring control means mounted on the failure detection target computer which has detected the failure controls execution of a restart instruction according to the operating state held by the operating state holding means. I do. That is, by referring to the operating state holding means, it is possible to grasp whether or not a restart instruction has already been issued, so that there is no need to give useless restart instructions.

【００１４】[0014]

【発明の実施の形態】以下、図面に基づいて、本発明の
好適な実施の形態について説明する。Preferred embodiments of the present invention will be described below with reference to the drawings.

【００１５】実施の形態１．図１は、本発明に係る計算
機の故障検出・回復方式の第１の実施の形態であるネッ
トワークシステムの全体構成図である。このネットワー
クシステムは、故障検出対象となる２台のＰＣサーバ１
０，２０と、ネットワークシステムの管理装置に相当す
るＰＣ３０と、これらを接続するＬＡＮ２で構成されて
いる。ＰＣサーバ１０，２０及びＰＣ３０は、常時稼働
している。ＰＣサーバ１０，２０及びＰＣ３０には、そ
れぞれＬＡＮボード１２，２２，３２が搭載され、他の
計算機との通信を行うことができる。また、ＰＣサーバ
１０，２０には、故障検出機能を実現するための故障検
知専用ボード１４，２４がそれぞれ搭載されており、こ
れらのボード１４，２４は、構成制御装置４を介して専
用ケーブル１５，２５で接続されている。故障検知専用
ボード１４，２４は、搭載されたＰＣサーバ１０，２０
が停止しても独立して動作可能であり、再起動（リブー
ト）の指示を出すことができる。 Embodiment 1 FIG. 1 is an overall configuration diagram of a network system according to a first embodiment of a computer failure detection / recovery method according to the present invention. This network system includes two PC servers 1 to be detected as failures.
0 and 20, a PC 30 corresponding to a management device of the network system, and a LAN 2 connecting these. The PC servers 10, 20 and the PC 30 are always operating. LAN boards 12, 22, and 32 are mounted on the PC servers 10, 20 and the PC 30, respectively, and can communicate with other computers. The PC servers 10 and 20 are equipped with failure detection boards 14 and 24, respectively, for realizing a failure detection function. These boards 14 and 24 are connected to the dedicated cables 15 through the configuration control device 4. , 25. The failure detection boards 14 and 24 are provided with the mounted PC servers 10 and 20 respectively.
Can be operated independently even if is stopped, and a restart (reboot) instruction can be issued.

【００１６】また、各ＰＣサーバ１０，２０では、ＰＣ
サーバ１０，２０の稼働状態を監視する監視プログラム
がメモリに常駐されて常時実行されており、ＣＰＵとと
もに稼働状態監視部１６，２６を構成する。稼働状態監
視部１６，２６は、故障検知専用ボード１４，２４が保
有する特定フラグ（図示せず）を定期的にリセットして
いる。また、各稼働状態監視部１６，２６は、ＬＡＮ２
を経由して定期的に通信を行い、他のＰＣサーバ２０，
１０が稼働しているかどうかのチェックを相互に行って
いる。更に、自己の計算機の再起動並びに他のＰＣサー
バ１０，２０に対して再起動の指示を行うことができ
る。In each of the PC servers 10 and 20, a PC
A monitoring program for monitoring the operating state of the servers 10 and 20 is resident in the memory and is always executed, and constitutes the operating state monitoring units 16 and 26 together with the CPU. The operating state monitoring units 16 and 26 periodically reset specific flags (not shown) held by the failure detection boards 14 and 24. Each of the operating state monitoring units 16 and 26 is connected to the LAN 2
Communicates periodically via the other PC server 20,
They check each other to see if they are running. Further, it is possible to instruct the restart of its own computer and the restart to the other PC servers 10 and 20.

【００１７】構成制御装置４は、全ての故障検知対象、
この例ではＰＣサーバ１０，２０の稼働状態を保持する
稼働状態テーブルを設け、その設定内容を更新すること
によって各ＰＣサーバ１０，２０の稼働状態を把握して
いる。稼働状態は、正常に稼働状態であることを表す
“正常”、故障し稼働していないことを表す“故障中”
及び故障発生後の回復動作中であることを表す“再起動
中”で表される。The configuration control device 4 includes all the failure detection targets,
In this example, an operation state table that holds the operation states of the PC servers 10 and 20 is provided, and the operation state of each of the PC servers 10 and 20 is grasped by updating the setting contents. The operating status is “normal” indicating that the operating status is normal, and “failing” indicating that the operating status is not working due to a failure.
And "restarting" indicating that the recovery operation is being performed after the occurrence of the failure.

【００１８】本実施の形態において特徴的なことは、故
障検知専用ボード１４，２４を用いた故障検出のみなら
ず、既存の構成すなわちＬＡＮ２を利用することによっ
てＰＣサーバ１０，２０の故障の検出をできるようにし
たことである。すなわち、故障検出をする経路の二重化
を図るようにしたことである。但し、本実施の形態にお
いては、単なる構成の二重化ではなく既存の構成を有効
に利用した異なる経路による二重化を図ることを特徴と
しており、これにより、コストの増大を抑止しつつ故障
検出機能の信頼度を向上させることができる。なお、本
実施の形態でいう故障とは、ＰＣサーバ１０，２０をリ
ブートすることによって回復することができる程度の異
常をいう。The feature of the present embodiment is that not only the failure detection using the failure detection dedicated boards 14 and 24 but also the failure detection of the PC servers 10 and 20 by using the existing configuration, that is, the LAN 2. That's what we can do. In other words, the path for detecting a failure is duplicated. However, the present embodiment is characterized in that, instead of merely duplicating the configuration, duplication is performed by using a different path that effectively uses the existing configuration, thereby suppressing the increase in cost and improving the reliability of the failure detection function. The degree can be improved. Note that the failure in the present embodiment refers to an abnormality that can be recovered by rebooting the PC servers 10 and 20.

【００１９】次に、本実施の形態における動作について
説明するが、まず、故障検知専用ボード１４，２４によ
り他のＰＣサーバ２０，１０の故障を検出し、更に故障
したＰＣサーバの再起動をする動作について図２に示し
たフローチャートを用いて説明する。なお、この説明で
は、ＰＣサーバ１０が故障するものとする。Next, the operation of the present embodiment will be described. First, the failure of the other PC servers 20, 10 is detected by the failure detection boards 14, 24, and the failed PC server is restarted. The operation will be described with reference to the flowchart shown in FIG. In this description, it is assumed that the PC server 10 breaks down.

【００２０】故障検知専用ボード１４は、定期的に監視
プログラムにより内部の特定フラグがリセットされるわ
けであるが、このリセット動作を常時監視し、一定の時
間以上その特定フラグがリセットされなかったことを検
知すると（ステップ１０１）、ＰＣサーバ１０が故障し
たと判断する。すなわち、メモリ常駐の監視プログラム
が定期的に行うフラグリセット処理が行われなくなった
ということでＰＣサーバ１０が故障したと判断する。故
障検知専用ボード１４は、このように判断すると、専用
ケーブル１５を介して構成制御装置４にその旨を通知す
る（ステップ１０２）。The failure detection board 14 periodically resets an internal specific flag by a monitoring program. The reset operation is constantly monitored, and if the specific flag is not reset for a certain period of time. Is detected (step 101), it is determined that the PC server 10 has failed. That is, it is determined that the PC server 10 has failed because the flag reset processing that is periodically performed by the memory-resident monitoring program is not performed. Upon making such a determination, the failure detection dedicated board 14 notifies the configuration controller 4 via the dedicated cable 15 (step 102).

【００２１】構成制御装置４は、故障検知専用ボード１
４から当該通知を受けると（ステップ１２１）、稼働状
態テーブルのＰＣサーバ１０の稼働状態を“正常”から
“故障中”に変更する（ステップ１２２）。そして、稼
働状態テーブルの設定内容を参照し、稼働状態が“正
常”であるＰＣサーバ２０の故障検知専用ボード２４に
対して、ＰＣサーバ１０が故障したという旨の送信を行
う（ステップ１２３）。なお、この例では、２台のＰＣ
サーバで構成しているためＰＣサーバ２０が一意に特定
することができるが、３台以上のＰＣサーバを有してお
り、複数のＰＣサーバの稼働状態が“正常”である場合
には、１台のＰＣサーバを選出するための条件、例えば
稼働状態テーブルの登録順や優先順位などにより１台の
ＰＣサーバを特定することになる。The configuration control device 4 includes the failure detection board 1
4 (step 121), the operating status of the PC server 10 in the operating status table is changed from "normal" to "failed" (step 122). Then, by referring to the setting contents of the operation state table, a notification indicating that the PC server 10 has failed is transmitted to the failure detection dedicated board 24 of the PC server 20 whose operation state is "normal" (step 123). In this example, two PCs
The PC server 20 can be uniquely specified because it is composed of servers. However, if there are three or more PC servers and the operating states of a plurality of PC servers are “normal”, 1 One PC server is specified based on the conditions for selecting one PC server, for example, the order of registration or priority in the operating state table.

【００２２】ＰＣサーバ２０の稼働状態監視部２６は、
故障検知専用ボード２４を介してＰＣサーバ１０が故障
したという報告を受けると（ステップ１４１）、この報
告の正当性を確認するため、故障したと報告されたＰＣ
サーバ１０に対してＬＡＮ２経由で通信を試みる（ステ
ップ１４２）。この通信の結果、ＰＣサーバ１０から応
答があって、その通信が正常終了すれば（ステップ１４
３）、故障しているのは、ＰＣサーバ１０そのものでは
なく故障検知専用ボード１４であると判断する。このと
き、稼働状態監視部２６は、故障検知専用ボード１４の
故障を回復させるためにＬＡＮ２経由でＰＣサーバ１０
に再起動の指示を送信する（ステップ１４４）。The operating state monitoring unit 26 of the PC server 20
When a report indicating that the PC server 10 has failed is received via the failure detection dedicated board 24 (step 141), in order to confirm the validity of the report, the PC reported as having failed is checked.
Attempt to communicate with the server 10 via the LAN 2 (step 142). As a result of this communication, if there is a response from the PC server 10 and the communication ends normally (step 14
3) It is determined that the failure is not the PC server 10 itself but the failure detection dedicated board 14. At this time, the operating state monitoring unit 26 sends the PC server 10 via the LAN 2 to recover the failure of the failure detection dedicated board 14.
(Step 144).

【００２３】一方、稼働状態監視部２６がしたＰＣサー
バ１０への通信の結果、ＰＣサーバ１０からの応答がな
かった場合、ＰＣサーバ１０は稼働していないと判断
し、報告の正当性の確認を終了する。この後、稼働状態
監視部２６は、故障検知専用ボード２４を介して故障検
知専用ボード１４にＰＣサーバ１０のリセット指示を出
す（ステップ１４５）。On the other hand, if there is no response from the PC server 10 as a result of the communication to the PC server 10 by the operation status monitoring unit 26, it is determined that the PC server 10 is not operating, and the validity of the report is confirmed. To end. Thereafter, the operating state monitoring unit 26 issues a reset instruction of the PC server 10 to the failure detection board 14 via the failure detection board 24 (step 145).

【００２４】ＰＣサーバ１０は、ＬＡＮ２あるいは故障
検知専用ボード１４を経由してＰＣサーバ２０からの再
起動指示を受け取ると、リブートを行う（ステップ１０
３）。When the PC server 10 receives a restart instruction from the PC server 20 via the LAN 2 or the failure detection dedicated board 14, it performs a reboot (step 10).
3).

【００２５】更に、稼働状態監視部２６は、上記のいず
れかの経路でＰＣサーバ１０に再起動の指示を出すと、
構成制御装置４にＰＣサーバ１０が再起動中である旨を
通知する（ステップ１４６）。構成制御装置４は、故障
検知専用ボード２４から当該通知を受けると、稼働状態
テーブルのＰＣサーバ１０の稼働状態を“故障中”から
“再起動中”に変更する（ステップ１２４）。Further, when the operating state monitoring unit 26 issues a restart instruction to the PC server 10 through one of the above-mentioned routes,
The configuration controller 4 is notified that the PC server 10 is being restarted (step 146). Upon receiving the notification from the failure detection dedicated board 24, the configuration control device 4 changes the operating state of the PC server 10 in the operating state table from "failure" to "restarting" (step 124).

【００２６】ＰＣサーバ１０において再起動の処理が完
了することにより、ＰＣサーバ１０自身あるいは故障検
知専用ボード１４が正常の状態に戻る。その後、稼働状
態監視部１６は、故障検知専用ボード１４を介して再起
動が完了した旨を構成制御装置４に通知する（ステップ
１０４）。構成制御装置４は、故障検知専用ボード１４
から当該通知を受けると、稼働状態テーブルのＰＣサー
バ１０の稼働状態を“再起動中”から“正常”に変更す
る（ステップ１２５）。When the restart processing is completed in the PC server 10, the PC server 10 itself or the failure detection board 14 returns to a normal state. Thereafter, the operation state monitoring unit 16 notifies the configuration control device 4 that the restart has been completed via the failure detection dedicated board 14 (step 104). The configuration control device 4 includes a failure detection dedicated board 14.
, The operating status of the PC server 10 in the operating status table is changed from “restarting” to “normal” (step 125).

【００２７】このように、故障検知専用ボード１４によ
って検出した故障をＬＡＮ２を使用してその故障の正当
性を確認することができ、また、ＰＣサーバ１０に再起
動をさせることで故障したＰＣサーバ１０あるいは故障
検知専用ボード１４を自動的に回復させることができ
る。In this way, the fault detected by the fault detection board 14 can be confirmed by using the LAN 2 for the validity of the fault, and by causing the PC server 10 to restart, 10 or the board 14 dedicated to failure detection can be automatically recovered.

【００２８】次に、ＬＡＮ２経由でＰＣサーバ１０，２
０の故障を検出し、更に故障したＰＣサーバの再起動を
する動作について図３に示したフローチャートを用いて
説明する。なお、ここでは、ＰＣサーバ１０が故障した
ものとして説明する。Next, the PC servers 10, 2 via the LAN 2
The operation of detecting the failure of 0 and restarting the failed PC server will be described with reference to the flowchart shown in FIG. Here, a description will be given assuming that the PC server 10 has failed.

【００２９】ＰＣサーバ１０，２０の間では、監視プロ
グラムによりＬＡＮ２を経由して定期的に通信を行い、
相互に稼働しているかどうかのチェックを行っている。
ここで、稼働状態監視部２６がＰＣサーバ１０との通信
に異常を検出した場合、ＰＣサーバ１０が稼働しておら
ず故障していると推定し（ステップ２４１）、故障検知
専用ボード２４を経由して構成制御装置４にその旨を通
知する（ステップ２４２）。The PC servers 10 and 20 periodically communicate via the LAN 2 by the monitoring program,
We are checking if they are running mutually.
Here, when the operation state monitoring unit 26 detects an abnormality in the communication with the PC server 10, it is estimated that the PC server 10 is not operating and has failed (step 241), and the operation is performed via the failure detection dedicated board 24. Then, the configuration controller 4 is notified of this (step 242).

【００３０】構成制御装置４は、その旨の通知を受け取
ると（ステップ２２１）、ＰＣサーバ１０の稼働状態を
参照する。その結果、“故障中”若しくは“再起動中”
であれば、ＰＣサーバ１０は故障したという旨の通知を
ＰＣサーバ１０からすでに受けていることになるので、
ＰＣサーバ１０は、実際に故障していることになる。ま
た、ＰＣサーバ１０の稼働状態が“正常”であれば、ま
だ、故障検知専用ボード１４の故障などが原因でその旨
の通知をＰＣサーバ１０から受けていないという可能性
もある。従って、構成制御装置４は、ＰＣサーバ２０に
そのいずれかの旨を通知する（ステップ２２２）。Upon receiving the notification (step 221), the configuration control device 4 refers to the operating state of the PC server 10. As a result, "failure" or "restarting"
In this case, the PC server 10 has already received a notification from the PC server 10 that the PC has failed.
The PC server 10 has actually failed. Further, if the operating state of the PC server 10 is “normal”, there is a possibility that a notification to that effect has not yet been received from the PC server 10 due to a failure of the failure detection dedicated board 14 or the like. Therefore, the configuration control device 4 notifies the PC server 20 of any of them (step 222).

【００３１】ＰＣサーバ２０は、構成制御装置４からの
報告の内容がＰＣサーバ１０の故障を認識しているよう
であれば（ステップ２４４）、そのまま処理を終了す
る。ＰＣサーバ１０は、自らリブートし再起動を行って
いるのであろうからＰＣサーバ１０の再起動に関する処
理を行う必要はないためである。また、仮にシステムが
３台以上のＰＣサーバを有している場合は、他のＰＣサ
ーバが再起動処理をすでに進めている可能性があるから
である。一方、ＰＣサーバ１０の稼働状態が“正常”の
とき、ＰＣサーバ２０は、ＬＡＮ２に接続されている他
のＰＣ３０と通信の試行を行う（ステップ２４４，２４
５）。ＰＣ３０からの応答があれば、先の通信処理にお
いて応答が得られなかったＰＣサーバ１０が故障してい
るとこの時点で判断する。一方、ＰＣ３０からの応答が
なければ、ＰＣサーバ１０及びＰＣ３０双方との通信に
失敗したことになるので、ＰＣサーバ１０が故障してい
るのではなくＰＣサーバ２０に搭載されたＬＡＮボード
２２が故障していると判断する。If the contents of the report from the configuration control device 4 indicate that the failure of the PC server 10 has been recognized (step 244), the PC server 20 ends the processing. This is because the PC server 10 does not need to perform the process related to the restart of the PC server 10 because the PC server 10 may have rebooted and restarted. Also, if the system has three or more PC servers, there is a possibility that another PC server has already proceeded with the restart processing. On the other hand, when the operation state of the PC server 10 is “normal”, the PC server 20 attempts communication with another PC 30 connected to the LAN 2 (steps 244 and 24).
5). If there is a response from the PC 30, it is determined at this point that the PC server 10 from which no response was obtained in the previous communication process has failed. On the other hand, if there is no response from the PC 30, it means that communication with both the PC server 10 and the PC 30 has failed. Therefore, the PC board 10 has not failed but the LAN board 22 mounted on the Judge that you are.

【００３２】従って、稼働状態監視部２６は、ＰＣ３０
からの応答があれば、故障検知専用ボード２４を経由し
てＰＣサーバ１０の故障検出を改めて通知するとともに
ＰＣサーバ１０の再起動の指示を行う（ステップ２４
６，２４７）。また、ＰＣ３０からの応答がなければ、
自己のＰＣサーバ２０が故障している旨の通知をすると
ともに自らリセットを行うことでリブートを行う（ステ
ップ２４６，２４８，２４９）。Therefore, the operating state monitoring unit 26 is connected to the PC 30
If there is a response from the PC, the failure detection of the PC server 10 is notified again via the failure detection dedicated board 24, and an instruction to restart the PC server 10 is issued (step 24).
6,247). If there is no response from PC 30,
Reboot is performed by notifying that its own PC server 20 has failed and by resetting itself (steps 246, 248, 249).

【００３３】構成制御装置４は、ＰＣサーバ２０からの
通知がＰＣサーバ１０の再起動指示であれば（ステップ
２２３）、稼働状態テーブルのＰＣサーバ１０の稼働状
態を“正常”から“再起動中”に変更するとともにＰＣ
サーバ１０の故障検知専用ボード１４にＰＣサーバ１０
の再起動指示を出す（ステップ２２４）。ＰＣサーバ１
０は、故障検知専用ボード１４の指示により自らリブー
トし、その処理が完了すると、稼働状態監視部１６は、
故障検知専用ボード１４を介して再起動が完了した旨を
構成制御装置４に通知する（ステップ２０１，２０
２）。構成制御装置４は、故障検知専用ボード１４から
当該通知を受けると、稼働状態テーブルのＰＣサーバ１
０の稼働状態を“再起動中”から“正常”に変更する
（ステップ２２５）。If the notification from the PC server 20 indicates that the PC server 10 is to be restarted (step 223), the configuration controller 4 changes the operating state of the PC server 10 from "normal" to "restarting" in the operating state table. To PC
The PC server 10 is attached to the failure detection board 14 of the server 10.
Is issued (step 224). PC server 1
0 reboots itself in accordance with the instruction of the failure detection dedicated board 14, and when the processing is completed, the operation state monitoring unit 16
The completion of the restart is notified to the configuration controller 4 via the failure detection board 14 (steps 201 and 20).
2). When the configuration control device 4 receives the notification from the failure detection dedicated board 14, the configuration control device 4 stores the PC server 1 in the operating state table.
The operating state of the “0” is changed from “restarting” to “normal” (step 225).

【００３４】一方、構成制御装置４は、ＰＣサーバ２０
からの通知がＰＣサーバ２０の故障であるという旨であ
れば、稼働状態テーブルのＰＣサーバ２０の稼働状態を
“正常”から“再起動中”に変更する（ステップ２２
６）。そして、ＰＣサーバ２０のリブートの完了報告
（ステップ２５０）を受け取ると、ＰＣサーバ２０の稼
働状態を“再起動中”から“正常”に変更する（ステッ
プ２２７）。なお、ＰＣサーバ２０の再起動は、自らリ
セットを行うのではなく、ＰＣサーバ１０の場合と同様
に、構成制御装置４によるステップ２２６の処理後、再
起動指示を故障検知専用ボード２４を介して受けること
により行うようにしてもよい。On the other hand, the configuration control device 4
If the notification from the server indicates that the PC server 20 is out of order, the operating status of the PC server 20 in the operating status table is changed from "normal" to "restarting" (step 22).
6). Then, when receiving the reboot completion report of the PC server 20 (step 250), the operating state of the PC server 20 is changed from "restarting" to "normal" (step 227). The restart of the PC server 20 is not performed by itself, but after the processing of step 226 by the configuration control device 4, a restart instruction is issued via the failure detection board 24 as in the case of the PC server 10. You may make it perform by receiving.

【００３５】このように、ＬＡＮ２を使用して検出した
故障をＬＡＮ２に接続された他のＰＣ３０との通信を試
行することによって故障の正当性を確認することがで
き、また、故障したＰＣサーバ１０あるいはＰＣサーバ
２０を自動的に回復させることができる。As described above, the validity of the failure can be confirmed by trying the failure detected using the LAN 2 by communicating with the other PC 30 connected to the LAN 2. Alternatively, the PC server 20 can be automatically recovered.

【００３６】以上、本実施の形態によれば、故障検出機
能を故障検知専用ボード１４，２４のみならず、既存の
ＬＡＮ２をも利用することによってできるようにしたの
で、故障検出機能を確実に発揮することができ信頼度を
向上させることができる。更に、故障検出のために用い
る手段の故障をも特定することができる。また、コスト
の増大を抑止しつつ故障検出機能の信頼性を向上させる
ことができる。As described above, according to the present embodiment, the failure detection function can be performed by using not only the failure detection boards 14 and 24 but also the existing LAN 2, so that the failure detection function is reliably exhibited. And reliability can be improved. Further, it is possible to specify the failure of the means used for failure detection. Further, the reliability of the failure detection function can be improved while suppressing an increase in cost.

【００３７】また、前述した図２及び図３の処理は、い
ずれかのみを動作させてもよいが、本実施の形態のよう
に同時並行して行うことによりその信頼性を更に向上さ
せることができる。Although the above-described processing in FIGS. 2 and 3 may operate only one of them, the reliability can be further improved by performing the processing simultaneously and in parallel as in the present embodiment. it can.

【００３８】なお、図３を用いて説明したＬＡＮ２経由
でＰＣサーバ１０，２０の故障を検出する場合、ＬＡＮ
２経由で通信の試行を行う先となるＰＣ３０は、上記説
明から明らかなように通信の試行相手としてのみ利用し
ており、故障検出機能の有無とは関係がない。従って、
ＰＣ３０に故障検出機能を持たせてＰＣサーバ１０，２
０と同等であっても何ら問題はない。但し、この場合
は、ＰＣ３０も故障検知専用ボードが搭載され構成制御
装置４に接続されることになる。When detecting a failure of the PC servers 10 and 20 via the LAN 2 described with reference to FIG.
As is clear from the above description, the PC 30 to which a communication trial is performed via the PC 2 is used only as a communication trial partner, and has nothing to do with the presence or absence of the failure detection function. Therefore,
The PC server 10, 2 is provided with a failure detection function in the PC 30.
There is no problem even if it is equal to 0. However, in this case, the PC 30 is also equipped with a failure detection dedicated board and is connected to the configuration control device 4.

【００３９】実施の形態２．ところで、上記実施の形態
１では、ＬＡＮ２に２台のＰＣサーバを故障検知対象計
算機として接続し、また、それぞれに他のＰＣサーバの
故障を検出し回復させる監視計算機としての機能を持た
せていた。すなわち、ＰＣサーバ１０，２０の全てに同
等の機能を持たせていた。これ以降の実施の形態では、
接続台数や機能の割振りなどの応用について例示する。 Embodiment 2 By the way, in the first embodiment, two PC servers are connected to the LAN 2 as failure detection target computers, and each has a function as a monitoring computer that detects and recovers failure of another PC server. . That is, all the PC servers 10 and 20 have the same function. In the following embodiments,
Examples of applications such as the number of connected units and allocation of functions will be described.

【００４０】例えば、図４に示したように１台の故障検
知対象計算機としてのＰＣサーバ４０に対してＰＣサー
バ４０の故障検出・回復を行うための監視計算機として
複数のＰＣサーバ４２−１，４２ー２，…，４２−ｎを
監視計算機群として設ける。すなわち、この場合のＰＣ
サーバ４０は、故障が検出されるだけであって他のＰＣ
サーバの故障を検出する機能は不要である。一方、ＰＣ
サーバ４２は、故障しても問題のない計算機である。こ
のような構成により、いずれかのＰＣサーバ４２が故障
したとしても、故障したＰＣサーバ４０を確実に検出
し、回復させることができる。但し、故障したＰＣサー
バ４０に対して再起動をさせるためには、１台の監視計
算機が動作すればよいので、これは、ＰＣサーバ４２の
間で優先順位を付けて重複動作を防止するようにする必
要がある。若しくは、稼働状態テーブルのＰＣサーバ４
０に対応した稼働状態を、“正常”から“故障中”に変
更させたもののみが再起動処理を行うようにしてもよ
い。For example, as shown in FIG. 4, a plurality of PC servers 42-1, 42-2 as monitoring computers for detecting and recovering failures of the PC server 40 with respect to one PC server 40 as a failure detection target computer. , 42-n are provided as a group of monitoring computers. That is, the PC in this case
The server 40 detects that a failure has occurred and is not connected to another PC.
No function to detect server failure is required. Meanwhile, PC
The server 42 is a computer having no problem even if it breaks down. With such a configuration, even if one of the PC servers 42 fails, the failed PC server 40 can be reliably detected and recovered. However, in order to restart the failed PC server 40, only one monitoring computer needs to operate, and this is done by assigning priorities among the PC servers 42 to prevent redundant operation. Need to be Or, the PC server 4 in the operation status table
Only the operating state corresponding to 0 that has been changed from “normal” to “under failure” may perform the restart processing.

【００４１】実施の形態３．図５に示したように、１台
のＰＣサーバ５２を専用の監視計算機として設け、故障
検知対象計算機として設けられた他のＰＣサーバ５０−
１，５０ー２，…，５０−ｎの故障検出・回復を一括し
て行うようにしてもよい。なお、この場合は、稼働状態
保持手段として設けられた構成制御装置を独自に設けず
ＰＣサーバ５２に内蔵することができる。 Embodiment 3 As shown in FIG. 5, one PC server 52 is provided as a dedicated monitoring computer, and another PC server 50- provided as a failure detection target computer is provided.
, 50-n may be collectively detected and recovered. In this case, the configuration control device provided as the operating state holding means can be built in the PC server 52 without independently providing the configuration control device.

【００４２】実施の形態４．図６に示したように、故障
検知対象計算機として複数のＰＣサーバ６０−１，６０
ー２，…，６０−ｎと、監視計算機として複数のＰＣサ
ーバ６２−１，６２ー２，…，６２−ｎを設けてもよ
い。 Embodiment 4 As shown in FIG. 6, a plurality of PC servers 60-1, 60 are used as the failure detection target computers.
, 60-n and a plurality of PC servers 62-1, 62-2,..., 62-n as monitoring computers.

【００４３】[0043]

【発明の効果】本発明によれば、故障検知手段のみなら
ず既存のネットワークを有効に利用した障害検出をする
ことができるので、コストの増大を抑止しつつ故障検出
機能の信頼度を向上させることが可能となる。すなわ
ち、故障を検出するための手段を二系統有することによ
って故障を検出するために用いる手段に故障が発生した
場合でも故障の検出を確実にできることのみならず、そ
の手段自身の故障であることをも認識することができ
る。また、故障検知対象計算機を再起動させるための経
路も二系統有することになるので、故障が発生した故障
検知対象計算機を確実にリブートさせ回復させることが
可能となる。According to the present invention, a fault can be detected effectively using not only the fault detecting means but also the existing network, so that the reliability of the fault detecting function can be improved while suppressing an increase in cost. It becomes possible. In other words, by having two means for detecting a failure, it is possible not only to reliably detect a failure even if a failure occurs in the means used for detecting the failure, but also to confirm that the failure is the failure of the means itself. Can also be recognized. Further, since there are two paths for restarting the failure detection target computer, the failure detection target computer in which the failure has occurred can be surely rebooted and recovered.

【００４４】また、稼働状態保持手段を設けたので、故
障検知対象計算機の稼働状態を把握することができるた
め、再起動処理を実行させる計算機を特定したり、ある
いは再起動処理を重複して実行させないようにすること
ができる。Further, since the operating state holding means is provided, the operating state of the failure detection target computer can be grasped. Therefore, the computer for which the restart processing is to be executed is specified, or the restart processing is repeatedly executed. Can be prevented.

[Brief description of the drawings]

【図１】本発明に係る計算機の故障検出・回復方式の
第１の実施の形態であるネットワークシステムの全体構
成図である。FIG. 1 is an overall configuration diagram of a network system according to a first embodiment of a computer failure detection / recovery method according to the present invention.

【図２】第１の実施の形態において故障検知専用ボー
ドにより他のＰＣサーバ故障を検出し、更に故障したＰ
Ｃサーバの再起動をする動作を示したフローチャートで
ある。FIG. 2 is a block diagram of a first embodiment in which another PC server failure is detected by a failure detection dedicated board;
It is the flowchart which showed the operation | movement which restarts a C server.

【図３】第１の実施の形態においてＬＡＮ２経由でＰ
Ｃサーバの故障を検出し、更に故障したＰＣサーバの再
起動をする動作を示したフローチャートである。FIG. 3 is a diagram showing a P via a LAN 2 in the first embodiment.
9 is a flowchart illustrating an operation of detecting a failure of the C server and restarting the failed PC server.

【図４】本発明に係る計算機の故障検出・回復方式の
第２の実施の形態であるネットワークシステムの全体構
成図である。FIG. 4 is an overall configuration diagram of a network system according to a second embodiment of the computer failure detection / recovery method according to the present invention.

【図５】本発明に係る計算機の故障検出・回復方式の
第３の実施の形態であるネットワークシステムの全体構
成図である。FIG. 5 is an overall configuration diagram of a network system according to a third embodiment of the computer failure detection / recovery method according to the present invention.

【図６】本発明に係る計算機の故障検出・回復方式の
第４の実施の形態であるネットワークシステムの全体構
成図である。FIG. 6 is an overall configuration diagram of a network system according to a fourth embodiment of the computer failure detection / recovery method according to the present invention.

【図７】従来のＰＣサーバの故障検出方式を示した図
である。FIG. 7 is a diagram showing a conventional PC server failure detection method.

[Explanation of symbols]

２ＬＡＮ、４構成制御装置、１０，２０，４０，４
２，５０，５２，６０，６２ＰＣサーバ、１２，２
２，３２ＬＡＮボード、１４，２４故障検知専用ボ
ード、１５，２５専用ケーブル、１６，２６稼働状
態監視部、３０ＰＣ。2 LAN, 4 configuration control device, 10, 20, 40, 4
2, 50, 52, 60, 62 PC server, 12, 2
2, 32 LAN board, 14, 24 Failure detection dedicated board, 15, 25 Dedicated cable, 16, 26 Operating status monitor, 30 PC.

Claims

(57) [Claims]

A plurality of computers among at least three computers connected to a network are set as failure detection target computers,
In a system, at least one of which is a monitoring computer that detects a failure that has occurred in the failure detection target computer, the failure detection target computer is operable independently, and detects a failure of the mounted failure detection target computer. Self-failure detection means, a network communication means for communicating with the other computers via a network, and a restart means for restarting itself according to an instruction.The monitoring computer operates independently A failure detection unit that is capable of detecting a failure of the other failure detection target computer connected via a dedicated line ; a network communication unit that communicates with another computer via a network; Operating state monitoring control means for monitoring the occurrence of a failure and performing a restart process of the failure detection target computer in which the failure is detected The operating state monitoring and control unit that detects a failure via the failure detection unit attempts communication with the failure detection target computer where the failure is detected via a network, and detects the failure where the failure is detected. If there is no response from the detection target computer, a restart instruction of the failure detection target computer in which a failure has been detected via the failure detection means is performed via a dedicated line, and if there is a response, the failure is detected. Determining that the self-failure detection means mounted on the failure detection target computer has a failure, and instructing a restart of the failure detection target computer in which the failure has been detected via a network. Recovery method.

2. The failure detection / recovery method for a computer according to claim 1, wherein said failure detection target computer is also said monitoring computer.

3. A system in which a plurality of computers out of at least three computers connected to a network are used as failure detection target computers, wherein the failure detection target computers are independently operable, and the installed failure detection Failure detection means for detecting a failure of the target computer and a failure of the other failure detection target computer connected via a dedicated line; network communication means for communicating with the other computer via a network; Restart means for restarting the system, and operating state monitoring control means for monitoring the occurrence of a failure in the system and performing a restart process of the failure detection target computer in which the failure has occurred, and via the network The operating state monitoring control means of the failure detection target computer that has detected a failure in The attempts to communicate with computers, if there is a response from the computer, perform restart indication of the failure detection unit the failure detection target computer failure is detected via the via dedicated lines, is the response If not, a failure detection / recovery method for a computer is characterized in that it is determined that the network communication means mounted therein has failed and the self-restart is performed.

4. The failure detection / recovery method for a computer according to claim 1, further comprising an operation state holding unit that holds an operation state of the failure detection target computer.

5. The operation state monitoring and control means mounted on the failure detection target computer which has detected a failure performs execution control of a restart instruction in accordance with an operation state held by the operation state holding means. The computer failure detection and recovery method according to claim 4.