JPH06175868A

JPH06175868A - Duplex computer fault monitoring method

Info

Publication number: JPH06175868A
Application number: JP4325321A
Authority: JP
Inventors: Junichi Ichikawa; 純一市川
Original assignee: Kawasaki Steel Corp
Current assignee: JFE Steel Corp
Priority date: 1992-12-04
Filing date: 1992-12-04
Publication date: 1994-06-24

Abstract

PURPOSE:To provide a mutual fault monitoring function with simple algorithm and hardware configuration without lowering the entire reliability of a duplex computer system. CONSTITUTION:Concerning a duplex computer connecting an A system computer provided with a central processing unit (CPU) 10A and a main storage device 14A connected through a local bus 12A to this CPU 10A and a B system computer provided with a CPU 10B and a main storage device 14B connected through a local bus 12B to this CPU 10B by line controllers 16A and 16B respectively connected to the local buses 12A and 12B and a line 24 for mutual monitor to connect these line controllers 16A and 16B, monitor limit time shorter than that of a slave system computer on standby is set to a main system computer under operating in the case of setting the monitor limit time to be time-up from the state transmission of the present system computer to the return reception from the other system computer.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、二重化計算機故障監視
方法、特にデュープレックス（duplex）方式による二重
化計算機システムにおいて、相互の故障監視に有効に適
用できる二重化計算機故障監視方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a dual computer fault monitoring method, and more particularly to a dual computer fault monitoring method that can be effectively applied to mutual fault monitoring in a duplex computer system.

【０００２】[0002]

【従来の技術】デュープレックス方式による二重化計算
機においては、主たる業務処理を遂行する主系計算機
と、通常は待機していて主系計算機が故障した場合にそ
の業務を引き継いで実行する従系計算機とからシステム
が構成されている。2. Description of the Related Art In a duplex computer based on a duplex system, a main computer that performs main business processing and a subordinate computer that normally waits and takes over the business when the main computer fails The system is configured.

【０００３】上記システムで採用されている故障監視方
法としては、主系計算機と従系計算機との間で相互に監
視信号を送信し合って相手系からの応答を時間監視する
ことによって、相手系に故障が発生したことを検出する
方法が知られている。As a failure monitoring method adopted in the above system, a monitoring signal is mutually transmitted between the main system computer and the secondary system computer, and the response from the other system is time-monitored. There is known a method of detecting that a failure has occurred in the.

【０００４】ところが、上記のように時間監視により故
障発生を検出する方法には、相互監視用の通信手段が故
障した場合にも、互いに相手系の計算機が故障したもの
と判定してしまうため、動作不能となる可能性がある。However, in the method of detecting the occurrence of a failure by the time monitoring as described above, even when the communication means for mutual monitoring fails, it is determined that the computers of the partner systems have failed. It may become inoperable.

【０００５】そこで、このように相互監視用通信手段の
故障が原因で動作不能となる事態が生じることを避ける
ために、該通信手段とは別に、故障が発生したことが検
出された場合には、それが相互監視用通信手段の故障な
のか、又は、どちらかの系の計算機の故障なのかを判定
するための故障箇所判定用通信手段を合せ持つようにし
た技術が知られている。Therefore, in order to avoid such a situation that the mutual monitoring communication means becomes inoperable due to the failure, if the failure is detected separately from the communication means, There is known a technique in which a failure point determination communication means for determining whether it is a failure of the mutual monitoring communication means or a failure of a computer of either system is also provided.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、前記の
ように、故障箇所判定用通信手段を併設する場合には、
二重化計算機相互の故障監視のために通信手段をも二重
化することになるため、主系計算機及び従系計算機相互
間の監視用通信に必要なハードウェア量が増加し、更に
相互故障監視のアルゴリズムが複雑化するという問題が
あった。However, as described above, in the case where the communication means for determining a failure portion is provided side by side,
Redundancy Since the communication means will also be duplicated for fault monitoring between computers, the amount of hardware required for communication for monitoring between the main computer and the slave computers will increase, and the mutual fault monitoring algorithm will increase. There was a problem of complication.

【０００７】本発明は、前記従来の問題点を解決するべ
く成されたもので、二重化計算機システム全体の信頼性
を低下させることなく、単純なアルゴリズムとハードウ
ェア構成によって、二重化計算機シスムの相互故障監視
機能を実現することができる二重化計算機故障監視方法
を提供することを課題とする。The present invention has been made to solve the above-mentioned conventional problems, and a mutual failure of a redundant computer system is achieved by a simple algorithm and a hardware configuration without lowering the reliability of the entire redundant computer system. An object is to provide a redundant computer failure monitoring method capable of realizing a monitoring function.

【０００８】[0008]

【課題を解決するための手段】本発明は、相互に自系計
算機の状態を他系計算機に通知するための監視用通信手
段と、自系計算機の状態を送信した後、該送信に対する
応答信号が他系計算機から返信されて来るまでの時間を
監視する手段とを有する二重化計算機において、自系計
算機及び他系計算機が相互に異常状態の発生を監視する
二重化計算機故障監視方法において、動作中の主系計算
機に対して、待機中の従系計算機より短い監視制限時間
を設定することにより、前記課題を達成したものであ
る。SUMMARY OF THE INVENTION According to the present invention, a monitoring communication means for mutually notifying the state of a self-system computer to another system computer and a response signal to the transmission after transmitting the state of the self-system computer. In a redundant computer having means for monitoring the time until it is returned from the other system computer, in the redundant computer failure monitoring method in which the own system computer and the other system computer mutually monitor the occurrence of an abnormal state, The above problem is achieved by setting a monitoring time limit for the main computer that is shorter than that of the standby sub computer.

【０００９】[0009]

【作用】本発明においては、故障監視用通信手段を使用
して相互に他系計算機からの信号を時間監視するに際
し、主系計算機に、監視制限時間を従系計算機より短く
設定するようにしたので、故障監視用通信手段が故障し
た場合には必ず先に主系計算機側で監視時間にタイムア
ウトが発生することになる。In the present invention, when the signals from other computers are mutually time-monitored using the failure monitoring communication means, the main computer is set to have a monitoring time limit shorter than that of the slave computers. Therefore, when the failure monitoring communication means fails, the main computer side always has a timeout in the monitoring time.

【００１０】その結果、このタイムアウトを検知するこ
とにより、故障監視用通信手段又は従系計算機のいずれ
かに故障が発生したことを検出することが可能となるた
め、主系計算機では動作を継続し、その間に故障対策を
講じることが可能となる。従って、１回線のみの故障監
視用通信手段を用いて、待機運用方式による二重化計算
機システムに対して確実な動作を保証することが可能と
なる。As a result, by detecting this time-out, it is possible to detect that a failure has occurred in either the failure monitoring communication means or the slave computer, so that the main computer continues to operate. In the meantime, it becomes possible to take countermeasures against breakdowns. Therefore, it becomes possible to guarantee the reliable operation of the redundant computer system by the standby operation system by using the failure monitoring communication means of only one line.

【００１１】[0011]

【実施例】以下、図面を参照して、本発明の実施例を詳
細に説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００１２】図１は、本発明に係る一実施例を適用する
二重化計算機を示すシステム構成図である。FIG. 1 is a system configuration diagram showing a redundant computer to which an embodiment according to the present invention is applied.

【００１３】上記二重化計算機システムは、基本的にＡ
系とＢ系のそれぞれ独立した２つの計算機と、これらに
共有されるコンソール端末や外部記憶装置とから構成さ
れている。なお、以下の説明では、Ａ系計算機には添字
Ａを、Ｂ系計算機には添字Ｂを付した符号を使用する。The above-mentioned redundant computer system is basically A
It is composed of two independent computers of the system and the B system, and a console terminal and an external storage device shared by these computers. In the following description, the reference numeral A is used for the A-system computer and the reference numeral B is used for the B-system computer.

【００１４】上記システムでは、Ａ系計算機（以下単に
Ａ系ともいう）及びＢ系計算機（以下単にＢ系ともい
う）がそれぞれ中央処理装置（ＣＰＵ−Ａ、ＣＰＵ−
Ｂ）１０Ａ、１０Ｂを有しており、これら中央処理装置
１０Ａ、１０Ｂには、高速なローカルバス１２Ａ、１２
Ｂを介して主記憶装置（ＭＥＭ−Ａ、ＭＥＭ−Ｂ）１４
Ａ、１４Ｂがそれぞれ接続されている。上記Ａ系のロー
カルバス１２Ａには、２つの回線制御装置（ＣＣＡ１、
ＣＣＡ２）１６Ａ、１８Ａが、又、Ｂ系のローカルバス
１２Ｂにも、同じく２つの回線制御装置（ＣＣＢ１、Ｃ
ＣＢ２）１６Ｂ、１８Ｂがそれぞれ接続されている。In the above system, the A-system computer (hereinafter simply referred to as the A-system) and the B-system computer (hereinafter simply referred to as the B-system) are each a central processing unit (CPU-A, CPU-).
B) 10A and 10B, and these central processing units 10A and 10B have high-speed local buses 12A and 12B.
Main memory device (MEM-A, MEM-B) 14 via B
A and 14B are respectively connected. Two line control devices (CCA1,
CCA2) 16A, 18A, and also on the B system local bus 12B, two line control devices (CCB1, CB)
CB2) 16B and 18B are respectively connected.

【００１５】上記回線制御装置１６Ａ、１６Ｂは、切換
スイッチ（ＳＷ１）２０を介して接続され、該切換スイ
ッチ２０に接続されているコンソール端末２２からの入
力により、いずれか一方に切換えられるようになってい
る。このコンソール端末２２は、オペレータがシステム
に指令を与えたり、システムからのメッセージを表示す
るために使用される。The line control devices 16A and 16B are connected via a changeover switch (SW1) 20 and can be switched to either one of them by an input from a console terminal 22 connected to the changeover switch 20. ing. The console terminal 22 is used by the operator to give commands to the system and display messages from the system.

【００１６】上記回線制御装置１８Ａ、１８Ｂは、相互
監視用回線２４で接続され、該回線２４を介してＡ系と
Ｂ系との間で相互に故障を監視するための伝文のやり取
りを行うことが可能となっている。この相互監視機能に
ついては、後に詳細に説明する。The line control devices 18A and 18B are connected by a mutual monitoring line 24, and messages are exchanged between the A system and the B system for mutual failure monitoring via the line 24. It is possible. This mutual monitoring function will be described in detail later.

【００１７】又、前記中央処理装置１０Ａ、１０Ｂには
入出力バス２６Ａ、２６Ｂがそれぞれ接続され、この入
出力バス２６Ａ、２６Ｂにはそれぞれ以下に説明する各
種の入出力制御装置が接続されている。Input / output buses 26A and 26B are connected to the central processing units 10A and 10B, and various input / output control devices described below are connected to the input / output buses 26A and 26B. .

【００１８】２８Ａ、２８Ｂは相手系の電源のオン、オ
フや前記切換スイッチ（ＳＷ１）２０及び後述する切換
スイッチ（ＳＷ２）３６等のコントロールを行うための
システム制御装置（ＣＴＬＡ、ＣＴＬＢ）である。又、
３０Ａ、３０Ｂは、磁気ディスク等の外部記憶装置を接
続するための標準バスであるＳＣＳＩバスの制御装置
（ＨＤＣＡ、ＨＤＣＢ）である。このＳＣＳＩバスは、
上記制御装置３０Ａと３０Ｂで共有され、Ａ系とＢ系と
から共通にアクセスすることが可能になっており、且つ
磁気ディスク（ＨＤＴＡ、ＨＤＴＢ）３２Ａ、３２Ｂに
それぞれ接続されている。Reference numerals 28A and 28B denote system control devices (CTLA, CTLB) for turning on / off the power source of the partner system and controlling the changeover switch (SW1) 20 and the changeover switch (SW2) 36 described later. or,
Reference numerals 30A and 30B are SCSI bus control devices (HDCA, HDCB) which are standard buses for connecting an external storage device such as a magnetic disk. This SCSI bus is
It is shared by the control devices 30A and 30B, can be commonly accessed by the A system and the B system, and is connected to the magnetic disks (HDTA, HDTB) 32A, 32B, respectively.

【００１９】上記２つの磁気ディスク３２Ａ、３２Ｂ
は、どちらか一方が故障しても業務は継続できるように
ミラー化されており、この２つの磁気ディスク３２Ａ、
３２Ｂには基本的に同じ内容が格納されている。The above two magnetic disks 32A and 32B
Are mirrored so that work can be continued even if one of them fails, and these two magnetic disks 32A,
32B basically stores the same contents.

【００２０】上記磁気ディスク３２Ａと３２Ｂにおける
ＯＳ−Ａ−１とＯＳ−Ａ−２には、Ａ系のＯＳ（オペレ
ーティングシステム）が使用するファイルが同一内容で
格納されている。又、ＯＳ−Ｂ−１とＯＳ−Ｂ−２に
は、Ｂ系のＯＳが使用するファイルが同一内容で格納さ
れている。又、ＢＡＴＡ１とＢＡＴＡ２にも、Ａ系又は
Ｂ系においてユーザが使用するデータファイルが同一内
容で格納されている。In OS-A-1 and OS-A-2 of the magnetic disks 32A and 32B, files used by an OS (operating system) of system A are stored with the same contents. The files used by the B-system OS are stored in the OS-B-1 and OS-B-2 with the same contents. Further, in BATA1 and BATA2, data files used by the user in the A system or the B system are stored with the same contents.

【００２１】前記入出力バス２６Ａ、２６Ｂにはそれぞ
れ複数の回線を制御する回線制御装置（ＣＣＡ３、ＣＣ
Ｂ３）３４Ａ、３４Ｂが接続されており、回線制御装置
３４Ａからの回線群と、回線制御装置３４Ｂからの回線
群とのどちらを外部に出力するかを、切換スイッチ（Ｓ
Ｗ２）３６で切換えられるようになっている。The I / O buses 26A and 26B are provided with line control devices (CCA3 and CC) for controlling a plurality of lines, respectively.
B3) 34A, 34B are connected, and which of the line group from the line control device 34A and the line group from the line control device 34B is output to the outside is determined by a changeover switch (S).
It can be switched by W2) 36.

【００２２】通常運用状態では、動作中の主系計算機
（以下、単に主系ともいう）側の回線制御装置が外部と
接続されており、主系が故障したために待機中の従系計
算機（以下、単に従系ともいう）が主系の業務を引き継
ぐ場合は、従系側の回線制御装置が外部と接続されるよ
うに、システム制御装置２８Ａ、２８Ｂからのスイッチ
コントロール信号によって上記切換スイッチ３６の切換
えが行われる。In the normal operation state, the line control unit on the operating main system computer (hereinafter, also simply referred to as the main system) is connected to the outside, and the standby system computer (hereinafter referred to as the standby system computer due to the failure of the main system). , Also referred to as a slave system) to take over the work of the master system, the switch control signal from the system controllers 28A and 28B is used to switch the switch 36 so that the line controller on the slave side is connected to the outside. Switching is performed.

【００２３】上述した各種装置以外に、周辺装置として
バックアップ用ＭＴ装置や各種通信装置を使用する場合
には、Ａ系、Ｂ系それぞれの入出力バス２６Ａ、２６Ｂ
に制御装置を接続し、該制御装置から系毎に独立して周
辺装置を接続してもよく、又、切換器を経由して周辺装
置に接続することも可能である。In addition to the above-mentioned various devices, when a backup MT device or various communication devices are used as peripheral devices, input / output buses 26A and 26B of A system and B system, respectively.
It is also possible to connect the control device to the peripheral device and connect the peripheral device independently from the control device for each system, or to connect to the peripheral device via the switch.

【００２４】次に、本実施例による故障監視の処理手順
について図２、図３のフローチャートを用いて詳細に説
明する。Next, the failure monitoring processing procedure according to this embodiment will be described in detail with reference to the flowcharts of FIGS.

【００２５】通常、Ａ系とＢ系のどちらが主系となるか
は、選択スイッチ（図示せず）による設定、又はユーザ
の設定によって、補助記憶装置内に格納されているフラ
グ等の手段によって指定される。Normally, which of the A system and the B system is the main system is designated by means such as a flag stored in the auxiliary storage device by setting with a selection switch (not shown) or by user's setting. To be done.

【００２６】主系においては、図２に示した流れに従っ
て故障管理処理が実行される。即ち、最初に、ステップ
２０１で伝文の送信間隔Ｔや、時間監視する際の監視制
限時間ＴＡ等の初期設定を行う。又、ここで主な業務ア
プリケーションプログラムを起動してもよい。In the main system, failure management processing is executed according to the flow shown in FIG. That is, first, in step 201, the message transmission interval T, the monitoring time limit TA for time monitoring, and the like are initialized. Also, the main business application program may be started here.

【００２７】次のステップ２０２で、従系に対して相互
監視用回線２４を通して、主系が正常であることを示す
正常伝文を送出する。In the next step 202, a normal message indicating that the master system is normal is sent to the slave system through the mutual monitoring line 24.

【００２８】次いで、ステップ２０３で、従系からの伝
文読み込みを行う。このとき、ＴＡ秒を制限時間とする
時間監視を同時に行い、次の判定ステップ２０４で従系
から正しい伝文を読み込んだか、又は制限時間ＴＡ秒以
内に伝文が従系から送信されずタイムアウトになったか
の判定を行う。Next, in step 203, a message is read from the slave. At this time, the time monitoring with TA seconds as the time limit is performed at the same time, and a correct message is read from the slave in the next determination step 204, or the message is not transmitted from the slave within the time limit TA seconds, and a timeout occurs. Determine if it has become.

【００２９】ステップ２０４で、監視時間がタイムアウ
トであったと判定されなかった場合にはステップ２０５
に進み、Ｔ秒間待った後ステップ２０２に戻る。If it is not determined in step 204 that the monitoring time has timed out, step 205
And waits for T seconds before returning to step 202.

【００３０】ステップ２０４で、逆にタイムアウトであ
ったと判定された場合には、従系が故障したものと判定
して、従系の電源を切断し、そのまま主系で業務を実行
する。On the contrary, if it is determined in step 204 that the time-out has occurred, it is determined that the slave system has failed, the power supply of the slave system is cut off, and the main system directly executes the work.

【００３１】一方、従系においては、図３に示した流れ
に従って故障監視処理が実行される。即ち、最初に、主
系の場合と同様に、伝文の送信間隔Ｔ、監視時間の制限
時間ＴＢ等の設定を行う。On the other hand, in the slave system, the failure monitoring process is executed according to the flow shown in FIG. That is, first, similarly to the case of the main system, the transmission interval T of the message, the monitoring time limit TB, etc. are set.

【００３２】次いで、ステップ３０２で、主系からの伝
文読み込みを行う。このとき、ＴＢ秒間を制限時間とす
る時間監視を同時に行い、次の判定ステップ３０３で主
系からの正常伝文を読み込んだか、又は制限時間ＴＢ秒
以内に伝文が主系から送信されずタイムアウトになった
かの判定を行う。Next, in step 302, a message is read from the master system. At this time, the time monitoring with the time limit of TB seconds is simultaneously performed, and the normal message from the master is read in the next determination step 303, or the message is not transmitted from the master within the time limit TB seconds and the timeout occurs. It is determined whether or not.

【００３３】ステップ３０３で、ステップ３０２で監視
した結果がタイムアウトであったと判定されなかった場
合にはステップ３０４に進みＴ秒間待った後、ステップ
３０５に進み、主系に正常伝文を送出し、その後、ステ
ップ３０２に戻る。If it is not determined in step 303 that the result of monitoring in step 302 is a time-out, the process proceeds to step 304, waits for T seconds, and then proceeds to step 305 to send a normal message to the main system. , Return to step 302.

【００３４】ステップ３０３で、逆にタイムアウトであ
ると判定された場合には、ステップ３０６に進み、主系
が故障したものと判断して主系の電源を切断してから、
従系側の業務アプリケーションプログラムを起動する。On the contrary, if it is determined in step 303 that the time-out has occurred, the process proceeds to step 306, in which it is determined that the main system has failed and the main system is powered off.
Start the business application program on the subordinate side.

【００３５】従系側で起動された業務アプリケーション
プログラムは、前記図１に示した磁気ディスク３２Ａ、
３２ＢのＤＡＴＡ−１、ＤＡＴＡ−２に格納されている
チェックポイントデータを使用して、業務を引き継いで
実行する。その際、同時に、必要に応じて切換スイッチ
２０、３６等の切換器を従系側に切換える。The business application program activated on the subordinate side is the magnetic disk 32A shown in FIG.
The checkpoint data stored in DATA-1 and DATA-2 of 32B is used to take over and execute the work. At the same time, at the same time, if necessary, the switches such as the changeover switches 20 and 36 are changed over to the slave side.

【００３６】従来は、主系側の制限時間ＴＡと従系側の
制限時間ＴＢとしては、同一で、しかも伝文の送信間隔
Ｔよりは十分長い時間を設定していた。そのため、一方
の系がシステムとして動作できないようなハードウェア
の故障に対しては正しく動作するが、相互監視用通信手
段である相互監視用回線２４や回線制御装置１８Ａ、１
８Ｂが故障した場合には、系を切換える必要のない場合
にも、相手系を故障と判断して系を切換える可能性があ
る。そのため、このような場合には、前述した如く、別
に用意した相互通信手段を用いて問合せを行い、系の故
障であるか、監視用通信手段の故障であるかを判定する
ようにしていた。従って、問合せ用の通信手段を別に持
つ必要があり、又アルゴリズム的にも複雑になるという
問題があった。Conventionally, the time limit TA on the master system side and the time limit TB on the slave system side are set to be the same, and a time sufficiently longer than the transmission interval T of the message is set. Therefore, although the system operates properly in the case of a hardware failure such that one system cannot operate as a system, the mutual monitoring line 24 and the line control devices 18A, 1A that are communication units for mutual monitoring.
When 8B fails, there is a possibility that the other system will be judged as failed and the system will be switched even if it is not necessary to switch the system. Therefore, in such a case, as described above, an inquiry is made using the separately prepared mutual communication means, and it is determined whether it is a system failure or a monitoring communication means. Therefore, there is a problem that it is necessary to additionally have a communication means for inquiry and the algorithm becomes complicated.

【００３７】本実施例においては、伝文の送信間隔Ｔに
比べ、監視用制限時間ＴＡ、ＴＢを十分に長くし、且つ
ＴＡをＴＢに比べ十分に短く、即ちＴ＜ＴＡ＜ＴＢ（例
えば、Ｔ＝１、ＴＡ＝５、ＴＢ＝２０）とした。その結
果、監視用通信手段が故障した場合には必ず主系側でタ
イムアウトが発生することになるため、その際には従系
を電源から切断して主系側の業務を続行することが可能
となる。勿論、どちらかの系がシステムとして動作不能
となるような故障が生じた場合にも、タイムアウトによ
って故障を検出し、正しく処理を行うことができる。In this embodiment, the monitoring time limits TA and TB are sufficiently longer than the message transmission interval T, and TA is sufficiently shorter than TB, that is, T <TA <TB (for example, T = 1, TA = 5, TB = 20). As a result, when the monitoring communication means fails, the master side always times out, and in that case, it is possible to disconnect the slave side from the power supply and continue the work of the master side. Becomes Of course, even when a failure occurs that makes one of the systems inoperable as a system, the failure can be detected by time-out and correct processing can be performed.

【００３８】次に、監視用通信回線が故障した場合の具
体的なシステムの動作例を図４を用いて説明する。Next, a specific example of the operation of the system when the monitoring communication line fails will be described with reference to FIG.

【００３９】図４において、通常動作状態においては、
a １で主系が正常伝文を送出し、b１で従系がこれを読
み込む。従系は、次のb ２でＴ秒間待った後、b ３で主
系に対して正常伝文を送出する。その後、従系はb ４で
主系からの伝文読み込み待ちと同時に時間監視状態に入
る。主系は、a ２で従系からの正常伝文を受信すると、
a ３でＴ秒間待った後、a ４にて正常伝文の送出を行
う。In FIG. 4, in the normal operation state,
The master sends a normal message at a1 and the slave reads it at b1. The slave system waits for T seconds at the next b 2 and then sends a normal message to the master system at b 3. After that, the subordinate system enters the time monitoring state at the same time as waiting for reading a message from the master system at b4. When the master receives the normal message from the slave in a 2,
After waiting for T seconds at a3, a normal message is sent at a4.

【００４０】ここで、a ３においてＴ秒間待っている間
に相互監視用回線が故障して通信不能になったものとす
る。この場合、従系では主系からの伝文待ちの時間監視
状態となり、同時に主系でもa ５の従系からの伝文待ち
の時間監視状態となる。Here, it is assumed that the mutual monitoring line fails and communication is disabled while waiting for T seconds at a3. In this case, the slave system is in the time monitoring state for waiting for a message from the master system, and at the same time, the master system is also in the time monitoring state for waiting for a message from the slave system in a5.

【００４１】しかしながら、主系の制限時間は従系に比
べて十分短いため、必ずa ６のタイムアウトが主系側で
発生する。従って、a ７で従系の電源を切断し、主系で
は業務アプリケーションが継続して実行することが可能
となる。However, since the time limit of the master system is sufficiently shorter than that of the slave system, a timeout of a6 always occurs on the master system side. Therefore, at a7, the power of the slave system is turned off, and the business application can be continuously executed in the master system.

【００４２】上述した動作シーケンス以外にも、回線故
障は任意のタイミングで起こり得るが、同様に正しくさ
せることが可能となる。又、回線故障の形態としては、
双方向の通信が不能になる場合と、片方向の通信が不能
になる場合とが有り得るが、そのいずれの形態において
も正しく動作することができる。In addition to the operation sequence described above, a line failure can occur at any timing, but it can be similarly corrected. In addition, as the form of line failure,
There may be a case in which bidirectional communication is disabled and a case in which unidirectional communication is disabled. In either of these modes, the correct operation is possible.

【００４３】以上の説明では、相手系に対する時間監視
でタイムアウトとなるような故障の場合を示したが、各
系がそれぞれ自系の故障を検出して相手系に故障を通知
することができる場合には、前記ステップ２０２又はス
テップ３０５における正常伝文送出手順において、異常
である旨を他系に通知し、それを受けた系がタイムアウ
トになった場合と同じ処理を行うことによってシステム
として正しく動作させることが容易にできる。In the above description, the case of a failure in which the time monitoring for the partner system times out is shown. However, when each system can detect its own failure and notify the partner system of the failure. In the normal message sending procedure in step 202 or step 305, the other system is notified that it is abnormal, and the same process as when the system receiving it is timed out to operate correctly as a system. It can be done easily.

【００４４】以上説明した本実施例によれば、タイムア
ウトによって相手系の異常を検出した場合には、相手系
の故障であるか、監視用通信手段の故障であるかを、そ
の異常検出時点では区別することができないが、このよ
うな場合には、まず異常と判定された系で診断プログラ
ムを動かし、故障箇所を診断し、主系、従系共に正常で
あった場合には、次に監視用回線を診断するプログラム
を動かすことによって、故障箇所を判定することができ
るため、結果として正確に故障箇所を特定することが可
能となる。According to the present embodiment described above, when the abnormality of the partner system is detected by the time-out, it is determined at the time of the abnormality detection whether the partner system has a failure or the monitoring communication means has a failure. Although it is not possible to distinguish between them, in such a case, first run the diagnostic program on the system that was determined to be abnormal, diagnose the failure point, and if both the main system and the slave system are normal, then monitor them. Since the failure location can be determined by running the program for diagnosing the service line, it is possible to accurately identify the failure location as a result.

【００４５】従って、本実施例によれば、相互監視のた
めの通信回線を１回線とし、単純なハードウェア構成と
処理手順によって、信頼性の高い二重化計算システムを
構築することができる。Therefore, according to this embodiment, it is possible to construct a highly reliable redundant computing system with a single communication line for mutual monitoring and a simple hardware configuration and processing procedure.

【００４６】以上、本発明について具体的に説明した
が、本発明は、前記実施例に示したものに限られるもの
でなく、その要旨を逸脱しない範囲で種々変更可能であ
る。Although the present invention has been specifically described above, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the invention.

【００４７】例えば、二重化計算機の具体的なシステム
構成は、前記実施例に示したものに限定されない。For example, the specific system configuration of the redundant computer is not limited to that shown in the above embodiment.

【００４８】[0048]

【発明の効果】以上説明した通り、本発明によれば、二
重化計算機システム全体の信頼性を低下させることな
く、単純なアルゴリズムとハードウェア構成によって、
二重化システムの相互故障監視機能を実現することがで
きる。As described above, according to the present invention, a simple algorithm and hardware configuration can be used without deteriorating the reliability of the entire redundant computer system.
The mutual failure monitoring function of the redundant system can be realized.

[Brief description of drawings]

【図１】本発明に係る一実施例を適用する二重化計算機
を示すシステム構成図FIG. 1 is a system configuration diagram showing a redundant computer to which an embodiment according to the present invention is applied.

【図２】上記実施例の作用を説明するためのフローチャ
ートFIG. 2 is a flow chart for explaining the operation of the above embodiment.

【図３】上記実施例の作用を説明するための他のフロー
チャートFIG. 3 is another flowchart for explaining the operation of the above embodiment.

【図４】監視用通信回線が故障した場合のシステムの動
作例を示す説明図FIG. 4 is an explanatory diagram showing an operation example of the system when the monitoring communication line fails.

[Explanation of symbols]

１０Ａ、１０Ｂ…中央処理装置１２Ａ、１２Ｂ…ローカルバス１４Ａ、１４Ｂ…主記憶装置１６Ａ、１６Ｂ、１８Ａ、１８Ｂ…回線制御装置２０…切換スイッチ２２…コンソール端末２４…相互監視用回線２６Ａ、２６Ｂ…入出力バス２８Ａ、２８Ｂ…システム制御装置３０Ａ、３０Ｂ…ＳＣＳＩ制御装置３２Ａ、３２Ｂ…磁気ディスク３４Ａ、３４Ｂ…回線制御装置３６…切換スイッチ 10A, 10B ... Central processing unit 12A, 12B ... Local bus 14A, 14B ... Main storage device 16A, 16B, 18A, 18B ... Line control device 20 ... Changeover switch 22 ... Console terminal 24 ... Mutual monitoring line 26A, 26B ... Output bus 28A, 28B ... System control device 30A, 30B ... SCSI control device 32A, 32B ... Magnetic disk 34A, 34B ... Line control device 36 ... Changeover switch

Claims

[Claims]

1. A monitoring communication means for notifying the status of an own system computer to another system computer and a status of the own system computer, and then a response signal to the transmission is returned from the other system computer. In a redundant computer that has a means to monitor the time until arrival, in the redundant computer failure monitoring method in which the own computer and the other computer mutually monitor the occurrence of an abnormal state, the standby computer waits for the operating main computer. A redundant computer fault monitoring method characterized by setting a monitoring time limit shorter than that of a subordinate computer.