JPH02195420A

JPH02195420A - Error recovery processing system for common magnetic disk device

Info

Publication number: JPH02195420A
Application number: JP1472889A
Authority: JP
Inventors: Masayoshi Tanigawa; 正芳谷川
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1989-01-24
Filing date: 1989-01-24
Publication date: 1990-08-02

Abstract

PURPOSE:To evade the system down of the whole system even when a fault is generated in a specific host and to improve the reliability of the system by detecting a defective path and a defective disk, releasing the reserve state of the defective host and continuing processing. CONSTITUTION:When specific disks 14, 15 are in the reserve state by the system down of a system 8, all starts from systems 1 to 7 are device busy. When a device-free interruption is invalid at the time of common operation of the systems, a defective path is retrieved by a path information control table 18. When a using flag 21 is not ON, whether a reserve flag 22 is ON or not is discriminated, and when the flag 22 is ON, the defective path 11 and the defective disks 14, 15 are detected. When the systems are not in the common operation, the device busy state can be released by issuing an UR command.

Description

【発明の詳細な説明】［概要］共用磁気ディスク装置におけるエラー回復処理方式に関
し、特定のホストの障害が生じても全システムのシステムダ
ウンを回避してシステムの信頼性を向上させることがで
きる共用磁気ディスク装置におけるエラー回復処理方式
を提供することを目的とし、複数のホストに接続される
接続パスと、データの転送を制御するディレクタ部を有
し、複数のディスクを制御する磁気ディスク制御装置に
おいて、前記接続パスと前記ディレクタ部との間に接続
ホスト識別部、デバイスアドレス、使用中フラグ、およ
びリザーブフラグを有するパス情報管理テーブルを設け
、該パス情報管理テーブルの検索により障害パスと障害
ディスクを検出して、障害ホストのリザーブ状態を解除
し、処理を続行するように構成した。[Detailed Description of the Invention] [Summary] Regarding an error recovery processing method in a shared magnetic disk device, the present invention relates to an error recovery processing method for a shared magnetic disk device, which can improve system reliability by avoiding system down of the entire system even if a failure occurs in a specific host. The purpose of this technology is to provide an error recovery processing method for magnetic disk devices, and it is a magnetic disk control device that controls multiple disks and has a connection path that connects to multiple hosts and a director unit that controls data transfer. , a path information management table having a connection host identification section, a device address, an in-use flag, and a reserve flag is provided between the connection path and the director section, and a failed path and a failed disk are identified by searching the path information management table. It was configured to detect the failed host, remove it from the reserved state, and continue processing.

［産業上の利用分野］本発明は、共用磁気ディスク装置におけるエラ−回復処
理方式に関する。[Industrial Application Field] The present invention relates to an error recovery processing method in a shared magnetic disk device.

大規模化した情報処理システムにおいては、多数のホス
トにより複数の磁気ディスク装置を共用運用している。In large-scale information processing systems, a plurality of magnetic disk devices are shared and operated by a large number of hosts.

この場合、特定のホストの障害により、磁気ディスク装
置のリザーブ状態が永久に続くときは、正常なホストか
らアクセスがあったとき、その磁気ディスク装置はデバ
イスビジ一応答となり、永久に使用が不可能となる。In this case, if the reserved state of the magnetic disk unit continues forever due to a failure of a specific host, when accessed by a normal host, the magnetic disk unit will respond with a device bus response and become permanently unusable. becomes.

したがって、このような場合には、障害ホストのリザー
ブ状態を解除して該当磁気ディスク装置を使用可能とす
ることが必要になる。Therefore, in such a case, it is necessary to release the reserved state of the faulty host to make the corresponding magnetic disk device usable.

［従来の技術］従来の磁気ディスク装置としては、例えば第５図に示す
ようなものがある。[Prior Art] As a conventional magnetic disk device, there is one shown in FIG. 5, for example.

第５図において、１は複数のホスト（システム）に接続
される接続パス、２．３は互いに独立したディレクタ部
、４はディレクタ部２，３により制御されるディスクで
ある。In FIG. 5, 1 is a connection path connected to a plurality of hosts (systems), 2 and 3 are mutually independent director units, and 4 is a disk controlled by the director units 2 and 3.

ディスク４は多数のホストにより共用運用され、特定の
ホストによりリザーブされた状態にあっては、他のホス
トからの起動はすべてデバイスビジーとなる。The disk 4 is shared by a large number of hosts, and when it is reserved by a specific host, all activations from other hosts result in the device being busy.

［発明が解決しようとする課題］しかしながら、このような従来の磁気ディスク装置にあ
っては、パスの管理はディレクタ部単位で行なっており
、ディレクタ部間、パス間の情報の伝達機能がないため
、共用運用時においては特定のホストに障害が生じると
、ディスク（デバイス）は永久にリザーブ状態となり、
他のホストが当該ディスクを使用することができなくな
ることがめった。すなわち、第６図に示すように、無限
ループの状態となり、最終的に全システムがシステムダ
ウンとなるという問題点があった。[Problems to be Solved by the Invention] However, in such conventional magnetic disk drives, path management is performed on a per-director unit basis, and there is no function for transmitting information between director units or between paths. During shared operation, if a failure occurs on a specific host, the disk (device) becomes permanently reserved.
Other hosts are rarely able to use the disk. That is, as shown in FIG. 6, there is a problem that an infinite loop occurs and the entire system eventually goes down.

本発明は、このような従来の問題点に鑑みてなされたも
のであって、特定のホストの障害が生じても全システム
のシステムダウンを回避してシステムの信頼性を向上さ
せることができる共用磁気ディスク装置におけるエラー
回復処理方式を提供することを目的としている。The present invention has been made in view of these conventional problems, and is a shared system that can avoid system failure of the entire system and improve system reliability even if a failure occurs in a specific host. The purpose of this paper is to provide an error recovery processing method for magnetic disk devices.

［課題を解決するための手段］第１図は本発明の原理説明図である。[Means to solve the problem] FIG. 1 is a diagram explaining the principle of the present invention.

第１図において、１１は複数のホストに接続される接続
パス、１２．１３はデータの転送を制御するディレクタ
部、１Ｂは前記接続パス１１と前記ディレクタ部１２．
１３との間に設けられ、接続ホスト識別品、デバイスア
ドレス、使用中フラグ、およびリザーブフラグを有する
パス情報管理テーブルでおる。In FIG. 1, 11 is a connection path connected to a plurality of hosts, 12.13 is a director unit that controls data transfer, and 1B is the connection path 11 and the director unit 12.13.
13, is a path information management table having connected host identification, device address, in-use flag, and reserve flag.

［作用］本発明においては、接続パスとディレクタ部との間にパ
ス情報管理テーブルを設け、共用運用時に特定のホスト
に障害が発生したとき、パス情報管理テーブルを検索し
て、使用中フラグがオフで、かつリザーブフラグがオン
のとき、障害パス、障害ディスクを検出するようにした
ため、ＬＪＲコマンドの発行によりデバイスビジー状態
を解除して、仙のホストからの起動が可能となるように
して、システム処理を継続することができる。[Operation] In the present invention, a path information management table is provided between the connection path and the director unit, and when a failure occurs in a specific host during shared operation, the path information management table is searched and the in-use flag is set. When it is off and the reserve flag is on, a faulty path or faulty disk is detected, so issuing the LJR command clears the device busy state and allows booting from the host on Sen. System processing can continue.

その結果、特定ホストの障害が他のシステムに波及する
のを未然に防止することができ、システムの信頼性を向
上させることができる。As a result, it is possible to prevent a failure of a specific host from spreading to other systems, and the reliability of the system can be improved.

［実施例］以下、本発明の実施例を図面に基づいて説明する。[Example] Embodiments of the present invention will be described below based on the drawings.

第２図〜第４図は本発明の一実施例を示す図である。FIGS. 2 to 4 are diagrams showing one embodiment of the present invention.

第２図において、１１は複数のホスト（システム１〜シ
ステム８）に接続される接続パス１２゜１３はデータ転
送を制御するディレクタ部、１４゜１５はディレクタ部
１２．１３によりアダプタ１６．１７を介して制御され
るディスクである。In FIG. 2, reference numeral 11 indicates connection paths 12 and 13 that are connected to multiple hosts (systems 1 to 8), and 14 and 15 indicate director units that control data transfer. The disk is controlled via

ここで、前記接続パス１１とディレクタ部１２゜１３と
の間には、第３図に示すようにパス情報管理テーブル１
８が設けられている。Here, as shown in FIG. 3, a path information management table 1 is provided between the connection path 11 and the director section 12.
8 is provided.

パス情報管理テーブル１８は、接続ホスト識別部１９と
、デバイスアドレス２０と、使用中フラグ２１と、リザ
ーブフラグ２２と、を有している。The path information management table 18 includes a connected host identification section 19, a device address 20, an in-use flag 21, and a reserve flag 22.

したがって、ディレクタ部１２．１３内およびディレク
タ部１２．１３間のパス使用状態を把ｊ屋し、制御する
ことができるようになっている。なお、接続パス１１と
、パス情報管理テーブル１８と、ディレクタ部１２．１
３が全体として磁気ディスク制御３１１装置を構成して
いる。Therefore, it is possible to monitor and control the path usage status within the director section 12.13 and between the director sections 12.13. Note that the connection path 11, the path information management table 18, and the director section 12.1
3 constitutes a magnetic disk control 311 device as a whole.

次に、第４図のフローチャートに基づいて動作を説明す
る。Next, the operation will be explained based on the flowchart shown in FIG.

まず、ステップＳ１でシステム８においてシステムダウ
ンが発生したとする。ステップＳ２でディスク１４．１
５がリザーブ状態でないときは、システム８のシステム
ダウンによる影響はないが（ステップＳ３、参照〉、特
定のディスク１４゜１５がシステム８によるシステムダ
ウンでリザーブ状態のときは、システム１〜７からの起
動はすべてデバイスビジーになる（ステップＳ４、参照
）。このデバイスビジー状態は障害システム８からのシ
ステムリセットが発行されないかぎり永久に解除されな
い。First, assume that a system down occurs in the system 8 in step S1. Disk 14.1 in step S2
When disk 5 is not in the reserved state, there is no effect due to the system down of system 8 (see step S3), but if a specific disk 14 or 15 is in the reserved state due to system down due to system 8, the system down from systems 1 to 7 is All startups result in a device busy state (see step S4).This device busy state is not permanently released unless a system reset is issued from the faulty system 8.

次に、ステップＳ５でシステムが共用運用であるか否か
を判別する。共用運用でないときは、ステップＳ６でＵ
Ｒ（Ｕｎｃｏｎｄｉｔｉｏｎａｌ　　Ｒｅ５ｅｒｖｅ）
コマンドを発行することでデバイスビジー状態を解除す
ることができ、ステップＳ７でシステム１〜７からの起
動が可能となり、システム処理を継続することができる
。Next, in step S5, it is determined whether the system is in shared operation. If it is not a shared operation, select U in step S6.
R (Unconditional Re5erve)
By issuing a command, the device busy state can be released, and in step S7, it becomes possible to start up from systems 1 to 7, and system processing can be continued.

一方、システムが共用運用の時は、ステップＳ８でデバ
イスフリー割込みが可能であるか否かを判別し、可能で
あるときは、ステップＳ７へ進み、システム処理を継続
する。デバイスフリー割込みが可能でないときは、ステ
ップＳ９で例えば３分間時間監視を行なう。時間監視を
行なってもデバイスフリー割込みができないときは、ス
テップＳ１０で障害パスの検索をパス情報管理テーブル
１８により行なう。On the other hand, when the system is in shared operation, it is determined in step S8 whether or not device-free interrupts are possible, and if possible, the process advances to step S7 to continue system processing. If device-free interrupts are not possible, time monitoring is performed for 3 minutes, for example, in step S9. If a device-free interrupt cannot be performed even after time monitoring, a search for a faulty path is performed using the path information management table 18 in step S10.

次に、使用中フラグ２１がオンであるか否かを判別し、
使用中フラグ２１がオンのときは、ステップ３１２へ進
み、システム稼動中でおるとじてステップＳ１０へ戻る
。Next, determine whether the in-use flag 21 is on,
When the in-use flag 21 is on, the process advances to step 312, and the process returns to step S10, assuming that the system is in operation.

使用中フラグ２１がオンでないときは、ステップＳ１３
でリザーブフラグ２２がオンであるか否かを判別し、オ
ンでないときは、ステップ３１４へ進み、システム運用
なしとしてステップＳ１０へ戻る。リザーブフラグ２２
がオンのときは、ステップＳ１５で障害パス１１、障害
ディスク１４゜１５を検出する。If the in-use flag 21 is not on, step S13
It is determined whether the reserve flag 22 is on or not, and if it is not on, the process proceeds to step 314, and returns to step S10 as the system is not in operation. Reserve flag 22
is on, the faulty path 11 and faulty disk 14°15 are detected in step S15.

すなわら、使用中フラグ２１は、ホストとの使用中およ
び一定時間内（例えば１０分）にアクセスがあるときの
みオンとなり、システムダウンなどによりアクセスがな
くなると、オフとなるので、前ｊホしたように、使用中
フラグ２１がオフで、かつリザーブフラグ２２がオンの
ときは、システムダウンにより障害パス１１、障害ディ
スク１４゜１５がおると判別される。In other words, the in-use flag 21 is turned on only when the host is in use and there is access within a certain period of time (for example, 10 minutes), and it is turned off when there is no access due to a system down, etc., so the previous j host As described above, when the in-use flag 21 is off and the reserve flag 22 is on, it is determined that the faulty path 11 and faulty disks 14 and 15 are present due to system failure.

障害パス１１、障害ディスク１４．１５が検出されたと
きは、システムが共用運用されていない状態と同等の状
態であるので、ステップＳ６へ進み、ＬＩＲコマンドを
発行してデバイスビジー状態を解除し、ステップＳ７で
システム１〜７からの起動が可能となるので、システム
処理を続行する。When the faulty path 11 and faulty disk 14.15 are detected, the system is in a state equivalent to not being shared, so the process advances to step S6, issues the LIR command, releases the device busy state, and In step S7, it becomes possible to start up from systems 1 to 7, so system processing continues.

なお、このような障害パス１１の検索および回復処理は
正常なシステム１〜７のうち最初にエラーを検出したシ
ステムが実行する。Note that such search and recovery processing for the failed path 11 is executed by the system that first detected the error among the normal systems 1 to 7.

［発明の効果］以上説明してきたように、本発明によれば、システムの
共用運用時に特定のホストで障害が発生したとぎに、障
害パス、障害ディスクを検出するようにしたため、障害
が他のシステムへ波及するのを未然に防止することがで
き、システム処理を継続することができる。その結采、
システムの信頼性を向上させることができる。[Effects of the Invention] As described above, according to the present invention, the faulty path and faulty disk are detected as soon as a fault occurs on a specific host during shared system operation, so that the fault is not caused by other faults. It is possible to prevent the problem from spreading to the system, and system processing can be continued. The conclusion,
System reliability can be improved.

[Brief explanation of the drawing]

第１図は本発明の原理説明図、第２図は本発明の一実施例を示すブロック図、第３図は
パス情報管理テーブルを示す図、第４図は動作を説明す
るためのフローチャート、第５図は従来例を示すブロッ
ク図、第６図は従来のフローチャートである。図中、１１・・・接続パス、１２．１３・・・ディレクタ部、１４．１５・・・ディスク、１６．１７・・・アダプタ、１８・・・パス情報管理テーブル、１つ・・・接続ホスト識別部、２０・・・デバイスアドレス、２１・・・使用中フラグ、２２・・・リザーブフラグ。杢売ｌ！月の犀玉里貌日月囮第１図従示イク゛啄牟すブ七ツクに第５図４水の７０−ヂｆ−ト第６図FIG. 1 is a diagram explaining the principle of the present invention, FIG. 2 is a block diagram showing an embodiment of the present invention, FIG. 3 is a diagram showing a path information management table, and FIG. 4 is a flowchart for explaining the operation. FIG. 5 is a block diagram showing a conventional example, and FIG. 6 is a conventional flow chart. In the figure, 11...Connection path, 12.13...Director section, 14.15...Disk, 16.17...Adapter, 18...Path information management table, 1...Connection Host identification part, 20... Device address, 21... In use flag, 22... Reserved flag. Heather sales! The Moon's Rhinoceros Face Sun Moon Decoy Figure 1 Subtitles: Figure 5 Figure 4 Water 70-F Figure 6

Claims

[Claims]

In a magnetic disk control device that controls a plurality of disks (14) and has a connection path (11) connected to a plurality of hosts and a director section (12), (13) that controls data transfer, the connection A path information management table (18) having a connection host identification section, a device address, an in-use flag, and a reserve flag is provided between the path (11) and the director sections (12) and (13) to manage the path information. An error recovery method for a shared magnetic disk device, characterized in that a failed path and a failed disk are detected by searching a table (18), the reserved state of the failed host is released, and processing is continued.