JP3995992B2

JP3995992B2 - Fault monitoring device

Info

Publication number: JP3995992B2
Application number: JP2002169101A
Authority: JP
Inventors: 隆都筑
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2002-06-10
Filing date: 2002-06-10
Publication date: 2007-10-24
Anticipated expiration: 2022-06-10
Also published as: JP2004015661A

Description

【０００１】
【発明の属する技術分野】
この発明は、監視対象である各種設備で起こった障害の発生および回復を検出し、通報する障害監視装置に関するものである。
【０００２】
【従来の技術】
例えば光通信システムでは、通信の品質を常時監視し、障害発生時にはシステム管理者が監視のために利用するオペレーション装置に通知することが必要であるが、通信品質が微妙に悪化しているために短時間に障害の発生と回復を連続的に繰り返す状態になることがしばしば発生する。その時、全ての障害発生および回復についてオペレーション装置に通知すると、大量の通知が発行され、システムの負荷が重くなるとともに、システム管理者にとっても大量の通知の中から本当に必要な通知を抽出しなければならないため手間が増す。そのため、大量に発生する障害の発生および回復通知から、必要な通知のみを抽出してオペレーション装置へ送信する方式を採用することが望まれる。従来では、例えば特開平５−１７５９２１号公報に示された障害監視装置のように、障害の発生および復旧を一定時間継続して検知した場合にのみ、オペレーション装置へ通知を行うことにより実現していた。
【０００３】
図１２は、特開平５−１７５９２１号公報に示された、障害監視装置を示す構成図である。図において、１０１は被監視ユニット、１０２は監視ユニット、１０３はソフトウェア、１０４は監視用のオペレーション装置としての報告ターミナル、１０５は継続発生復旧監視回路、１０６は処理部である。被監視ユニット１０１は、通信など装置が本来目的とする処理を行う処理部１０６と、処理部１０６で発生する障害の検出を行う継続発生復旧監視回路１０５を有する。監視ユニット１０２は、継続発生復旧監視回路１０５から通知された障害の発生および復旧を報告ターミナル１０４に通知するソフトウエア１０３を有する。
【０００４】
次に動作について説明する。
継続発生復旧監視回路１０５は、処理部１０６において発生した障害が一定時間以上継続した場合にのみ、障害の発生をソフトウエア１０３へ通知する。監視ユニット１０２は、継続発生復旧監視回路１０５から通知された障害を報告ターミナル１０４へ通知できる形に処理し、通知する。
【０００５】
復旧の場合も同様に、処理部１０６において復旧した障害が一定時間以上継続した場合にのみ継続発生復旧監視回路１０５が障害の復旧を監視ユニット１０２へ通知し、監視ユニット１０２が報告ターミナル１０４へ障害復旧の通知を行う。
【０００６】
図１３は、従来の障害監視装置によるオペレーション装置への障害通知例を示す図である。ここでは、監視周期を１秒、ソフトウェア１０３へ通知を行うために障害の発生および復旧の継続を確認する時間を３秒としたときの、報告ターミナル１０４への通知例を示す。図中、「検出された障害発生／回復」は、処理部１０６で起こった障害の発生および回復を示しており、「連続発生／回復時間」はそれぞれの継続時間を表す。また、「オペレーション装置への通知」は、報告ターミナル１０４に対し、障害の発生および回復が通知されるタイミングを示している。図より、障害の発生は１，３，７，９，１４，１８，２０，２７の各秒に、回復は２，６，８，１０，１７，１９，２４，２８の各秒に起こっているが、３秒以上連続して発生および回復が検知されないと報告ターミナル１０４への通知は行われないので、発生については５，１６の各秒に、回復については１２，２６の各秒のみ通知される。ここで例えば、２７秒に発生した障害は１秒で回復しており継続確認時間である３秒を下回っているので、報告ターミナル１０４への通知は行われない。
【０００７】
【発明が解決しようとする課題】
従来の障害監視装置は以上のように構成されているので、障害の発生も復旧も一定時間継続しない限り、システム管理者が利用する報告ターミナル１０４へ通知されない。そのため、瞬時に発生または回復したために一定時間に達しなかった障害が断続的に発生した場合は、報告ターミナル１０４へ全く障害の通知がされないことになり、システム監視品質の低下を招いてしまうという課題があった。
【０００８】
この発明は上記のような課題を解決するためになされたもので、障害が瞬時に発生および回復を繰り返す場合でも、システム管理者にとって必要な障害は取りこぼさずに通知し、システム監視品質を低下させずに通知量を削減することが出来る障害監視装置を得ることを目的とする。
【０００９】
【課題を解決するための手段】
この発明に係る障害監視装置は、被監視部の障害発生および回復を検出し、検出信号を出力する障害検知部と、検出信号が障害の発生を示しているときは検出信号の確認と同時にオペレーション装置へ障害発生の状態を表す通知情報を出力し、検出信号が障害の回復を示しているときは回復状態が一定時間継続した時点でオペレーション装置へ障害回復の状態を表す通知情報を出力し、また、障害検知部で障害の回復が検出された時刻を障害回復時刻として記憶し、オペレーション装置へ障害回復の状態を表す通知情報を出力する際、障害回復時刻を通知情報に付加する制御部とを備えたものである。
【００１１】
この発明に係る障害監視装置は、被監視部の障害発生および回復を検出し、検出信号を出力する障害検知部と、検出信号を確認すると同時に障害発生および回復の状態を表す通知情報をオペレーション装置へ出力し、また、一定時間内に検出される障害発生と回復の繰り返しの頻度が一定数を超えた場合は障害多発状態と判定し、検出信号が障害の回復を示しているときは回復状態が一定時間継続した時点でオペレーション装置へ障害回復の状態を表す通知情報を出力するように切替える制御部とを備えたものである。
【００１２】
この発明に係る障害監視装置は、被監視部の障害発生および回復を検出し、検出信号を出力する障害検知部と、検出信号を確認すると同時に障害発生および回復の状態を表す通知情報をオペレーション装置へ出力し、また、一定時間内に検出される障害発生と回復の繰り返しの頻度が一定数を超えた場合は障害多発状態と判定し、オペレーション装置へ障害多発状態であることを通知するとともに、オペレーション装置への障害発生および回復の状態を表す通知情報の出力を停止する制御部とを備えたものである。
【００１３】
【発明の実施の形態】
以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１による障害監視装置の構成を示すブロック図である。図において、１は監視装置、２は被監視部、３は障害検知部、４は制御部、５は出力部（制御部）、６はオペレーション装置である。監視装置１は、例えば光通信装置などであり、被監視部２は本来の目的である通信処理等を行い、障害検知部３、制御部４、出力部５にて、被監視部２での障害の発生および回復の監視を行う。
また、オペレーション装置６は、管理者が監視装置１の監視を行うために利用され、ディスプレイへの表示等により監視装置１の障害を監視することが出来る。オペレーション装置６へは、制御部４が出力した障害通知データを出力部５がオペレーション装置６へ通知出来る形式に変換して出力する。
【００１４】
次に動作について説明する。
被監視部２において障害が発生または回復すると、障害検知部３により検知される。障害検知部３により検知された障害の発生および回復は、制御部４に出力される。図２は、この発明の実施の形態１による、制御部４が行う障害通知の制御処理のフローチャートである。ここでは監視周期を１秒とし、障害回復をオペレーション装置６へ通知するために継続を確認する時間を３秒とする。
【００１５】
ステップＳＴ１およびステップＳＴ２は制御部４内部の初期化処理であり、それぞれ警報フラグをＯＦＦに設定し、障害の回復継続時間を０秒に設定する。警報フラグは、現時点での障害の発生または回復の状態を表しており、ＯＦＦであれば障害は発生していない状態、ＯＮであれば障害の発生が継続している状態である。制御部４による監視はステップＳＴ３からステップＳＴ１５までのステップを１秒ごとの定周期で繰り返し行う。
【００１６】
ステップＳＴ３で制御部４は、障害検知部３からの障害の検出信号の有無を確認する。ステップＳＴ４では検出信号の内容が障害の発生であるか回復であるかを判断し、発生であると判断された場合はステップＳＴ５へ、回復であると判断された場合はステップＳＴ９へ進む。
【００１７】
ステップＳＴ５では、制御部４は警報フラグを参照し、警報フラグがＯＮになっていれば、現在、障害の発生が継続している状態であると判断し、出力部５への通知は行わずにステップＳＴ１５へ進み、次回周期の監視まで待機する。ステップＳＴ５において、警報フラグがＯＦＦになっていれば、現在、障害は発生していない状態であると判断され、ステップＳＴ６で制御部４は警報フラグをＯＮにする。次にステップＳＴ７で制御部４は障害の発生を出力部５へ通知し、出力部５はオペレーション装置６へ通知できる形式に障害発生通知を変換し、出力する。次にステップＳＴ８で障害の回復継続時間を０秒に設定し、ステップＳＴ１５へ進んで、次回周期の監視まで待機する。
【００１８】
ステップＳＴ９では、制御部４は警報フラグを参照し、警報フラグがＯＦＦになっていれば、現在、障害は発生しておらず回復状態が継続していると判断し、出力部５への通知は行わずにステップＳＴ１５へ進み、次回周期の監視まで待機する。ステップＳＴ９において、警報フラグがＯＮになっていれば、現在、障害の発生が継続している状態であると判断し、ステップＳＴ１０へ進み制御部４は回復継続時間を１秒加算する。次にステップＳＴ１１で制御部４は回復継続時間の確認を行い、３秒未満である場合は、障害回復の通知を行う条件に達していないと判断してステップＳＴ１５へ進み、次回の監視まで待機する。回復継続時間が３秒以上であれば、ステップＳＴ１２で回復継続時間を０秒に戻し、ステップＳＴ１３で警報フラグをＯＦＦに設定し、ステップＳＴ１４で障害の回復を出力部５を経由してオペレーション装置６へ通知する。その後ステップＳＴ１５へ進み、次回周期の監視まで待機する。
【００１９】
ステップＳＴ１５では、制御部４は次の監視周期まで待機し、ステップＳＴ３に戻る。
【００２０】
図３は、この発明の実施の形態１による障害監視装置によるオペレーション装置への障害通知例を示す図である。図中、「検出された障害発生／回復」は、障害検知部３により検出された被監視部２での障害の発生および回復を示しており、図１３に示した従来の障害通知例と同じ条件になっている。「連続回復時間」は回復状況の継続時間を表す。また、「オペレーション装置への通知」は、オペレーション装置６に対し、障害の発生および回復が通知されるタイミングを示している。図より、障害の発生は１，３，７，９，１４，１８，２０，２７の各秒に、回復は２，６，８，１０，１７，１９，２４，２８の各秒に起こっている。実施の形態１では障害の発生は無条件に、回復は３秒以上連続して検知された場合にオペレーション装置６へ通知するので、発生については１，１４，２７の各秒に、回復については１２，２６，３０の各秒に通知される。従来例では通知できなかった２７秒での障害発生および２８秒での障害回復を、ここではそれぞれ２７秒と３０秒にオペレーション装置６へ通知することが出来る。
【００２１】
以上のように、この実施の形態１によれば、制御部４は、障害検知部３による被監視部２での障害の検出信号を一定周期で監視し、障害の発生を示す検出信号が出力された場合には、無条件に出力部５を介してオペレーション装置６へ通知し、検出信号が障害の回復を示している場合には、回復の通知が３秒継続したら通知するようにしたので、瞬時に発生した障害を取りこぼすことなく通知することが可能であり、監視品質の低下を招くことなく通知量を削減し、システム管理者にとって必要な障害の通知が行えるという効果が得られる。
【００２２】
実施の形態２．
実施の形態２による障害監視装置の構成は、図１に示されるものと同一であるが、制御部４の動作が異なる。
次に動作について説明する。
図４は、この発明の実施の形態２による、制御部４が行う障害通知の制御処理のフローチャートである。ステップＳＴ１、ステップＳＴ２は図２と同様、制御部４内部の初期化処理である。また実施の形態１と同様に、制御部４による監視はステップＳＴ３からステップＳＴ１５までのステップを１秒ごとの定周期で繰り返し行う。
【００２３】
ステップＳＴ３で障害検知部３からの障害の検出信号を確認すると、ステップＳＴ４では検出信号の内容が障害の発生であるか回復であるかを判断し、発生であると判断された場合はステップＳＴ５へ、回復であると判断された場合はステップＳＴ９へ進む。ステップＳＴ５へ進んだ場合、以降の処理は実施の形態１と同様である。
【００２４】
ステップＳＴ９では、制御部４は警報フラグを参照し、警報フラグがＯＦＦになっていれば、現在、障害は発生していない（回復状態が継続している）と判断し、出力部５への通知は行わずにステップＳＴ１５へ進み、次回周期の監視まで待機する。ステップＳＴ９において、警報フラグがＯＮになっていれば、現在、障害の発生が継続している状態であると判断し、ステップＳＴ４１へ進む。ステップＳＴ４１で制御部４は、回復継続時間を確認する。回復継続時間が０秒である場合は、現時点で障害の発生から回復に転じたものと判断し、ステップＳＴ４２で制御部４は、現在の時刻を障害検知部３で障害の回復が検出された時刻（障害回復時刻）として保存する。ステップＳＴ４１で回復継続時間が０秒でないことを確認した場合は、障害の回復状態が継続していると判断して、ステップＳＴ４２をスキップしてステップＳＴ１０に進む。
【００２５】
ステップＳＴ１０で制御部４は回復継続時間を１秒加算する。次にステップＳＴ１１で制御部４は回復継続時間の確認を行い、３秒未満である場合は、障害回復の通知を行う条件に達していないと判断してステップＳＴ１５へ進み、次回の監視まで待機する。回復継続時間が３秒以上であれば、ステップＳＴ１２で回復継続時間を０秒に戻し、ステップＳＴ１３で警報フラグをＯＦＦに設定する。ステップＳＴ４３で制御部４は、出力部５を経由してオペレーション装置６へ障害の回復を通知する。その際、制御部４に保存していた障害回復時刻をオペレーション装置６への通知情報に付加する。その後ステップＳＴ１５へ進み、次回周期の監視まで待機する。
【００２６】
図５は、この発明の実施の形態２による障害監視装置によるオペレーション装置への障害通知例を示す図である。被監視部２で発生した障害の発生および回復は図３と同条件になっている。実施の形態１と同様に実施の形態２においても、障害の回復については障害が回復し始めてから３秒経過後にオペレーション装置６に通知されるが、実施の形態２では障害回復時刻が通知情報に付加されており、障害が回復し始めた１０秒、２４秒、２８秒という遅延のない正確な時刻の情報が付加されるため、監視品質を向上させることができる。
【００２７】
以上のように、この実施の形態２によれば、制御部４は、障害検知部３で障害の回復が検出された時刻を障害回復時刻として記憶し、障害の回復をオペレーション装置６へ通知する際に、障害回復時刻を併せて通知するようにしたので、障害の発生および回復の検出時刻については遅延のない正確な時刻の通知が可能であり、監視品質がさらに向上するという効果が得られる。
【００２８】
実施の形態３．
実施の形態３による障害監視装置の構成は、図１に示されるものと同一であるが、制御部４の動作が異なる。実施の形態３では、制御部４は、障害の発生および回復が短時間に多発する障害ばたつき状態（障害多発状態）であるかどうかを監視し、障害ばたつき状態の時と通常状態の時で、オペレーション装置６への通知の規則を変える。
【００２９】
図６は、この発明の実施の形態３による、制御部４が行う障害ばたつき状態監視のフローチャートである。ここでは、制御部４は、現時点から過去５秒間に５回以上の障害の発生および回復が検知された場合に障害ばたつき状態であるとみなす。ステップＳＴ１０１からステップＳＴ１０３は制御部４で記憶している情報についての初期化処理である。ステップＳＴ１０１でばたつきフラグをＯＦＦにする。ばたつきフラグは障害ばたつき状態であるか通常状態であるかを表しており、ＯＮであれば障害ばたつき状態、ＯＦＦであれば通常状態である。次にステップＳＴ１０２で５秒間の障害発生および回復累計を０件にし、ステップＳＴ１０３にて０秒前（現在）から５秒前までのそれぞれの時点で障害検知部３により検知された障害の発生および回復数を０件に設定する。
【００３０】
障害ばたつき状態の監視は、０秒前（現在）から４秒前までの障害の発生および回復検知数を累計することにより行うので、ステップＳＴ１０４で制御部４は、５秒前の障害発生および回復検知数を０件にクリアする。ステップＳＴ１０５では、制御部４は障害検知部３で検知した障害の発生および回復件数を０秒前時点での障害発生および回復の検知数として格納する。次にステップＳＴ１０６で、０秒前から４秒前までの障害発生および回復検知数を累計し、ステップＳＴ１０７で制御部４は、ステップＳＴ１０６で得た累計が５件を超えているかを確認する。５件を超えている場合は、障害ばたつき状態と判断され、ステップＳＴ１０８で制御部４はばたつきフラグをＯＮに設定する。また、ステップＳＴ１０７で５件未満であればステップＳＴ１０９にスキップする。ステップＳＴ１０９で制御部４は、ステップＳＴ１０６で得た累計が０件であるかどうか判定し、０件である場合は、障害ばたつき状態は解消しているものと判断し、ステップＳＴ１１０でばたつきフラグをＯＦＦに設定する。ステップＳＴ１０９で、累計が０件でなかった場合は、ステップＳＴ１１１にスキップする。ステップＳＴ１１１では０秒前から４秒前の障害発生および回復検知数として記憶されている値を、次回の障害ばたつき状態判断のため、１秒前から５秒前までの障害発生および回復検知数にシフトして格納し、ステップＳＴ１０４に戻る。
【００３１】
図７は、この発明の実施の形態３による、制御部４が行う障害通知の制御処理のフローチャートである。実施の形態１と同様に、制御部４から出力部５への障害発生および回復の通知はステップＳＴ３からステップＳＴ１５までのステップを１秒ごとの定周期で繰り返し行うことにより処理される。ステップＳＴ４で制御部４が障害検知部３からの通知により障害の回復を検知し、ステップＳＴ９で、警報フラグがＯＮになっていることを確認すると、現在は障害が発生している状態と判断され、ステップＳＴ６１に進む。ステップＳＴ６１で制御部４は、ばたつきフラグの値を確認し、ばたつきフラグがＯＮであれば障害ばたつき状態であると判断してステップＳＴ１０へ進み、また、ばたつきフラグがＯＦＦであれば通常状態であると判断してステップＳＴ１３へ進む。通常状態の場合は障害回復の継続確認は行わないので、ステップＳＴ１３では、障害は回復したものとし、制御部４は警報フラグをＯＦＦに設定する。次に、ステップＳＴ１４で障害の回復を出力部５経由でオペレーション装置６へ通知する。ステップＳＴ６１にて、障害ばたつき状態であると判断された場合は、ステップＳＴ１１からステップＳＴ１４までの処理で障害回復の継続確認を行う。ステップＳＴ１１で回復継続時間が３秒以上であれば、障害は回復したものとし、ステップＳＴ１２で回復継続時間を０秒に設定し、ステップＳＴ１３で警報フラグをＯＦＦに設定し、ステップＳＴ１４で出力部５への通知を行う。ステップＳＴ１１で回復継続時間が３秒未満であるときは、ステップＳＴ１５へ進み、次回の監視まで待機する。
ステップＳＴ４で、障害の発生が通知されたときは、実施の形態１と同様に、ステップＳＴ５からステップＳＴ１５までの処理を行い、オペレーション装置６に、障害の発生を通知する。
【００３２】
図８は、この発明の実施の形態３による障害監視装置によるオペレーション装置への障害通知例を示す図である。図より、障害検知部３に検知された障害の発生は１，３，７，９，１４，１８，２０の各秒に、回復は２，６，８，１０，１７，１９，２４の各秒に起こっている。図中、「障害発生／回復累計」は、過去５秒間の障害発生および回復検知数を示している。また、「警報ばたつきフラグ」は、警報ばたつきフラグのＯＮ、ＯＦＦを示しており、「障害発生／回復累計」が５件以上になった１０秒から０件になった２９秒までは、警報ばたつきフラグはＯＮの状態になっている。警報ばたつきフラグがＯＦＦの間は、制御部４は障害検知部３で検知された障害発生および回復をそのまま出力部５経由でオペレーション装置６へ通知する。警報ばたつきフラグがＯＮの間は障害回復については３秒以上継続が確認された時点で、制御部４により出力部５経由でオペレーション装置６へ通知される。このため、障害ばたつき状態の時は、オペレーション装置６への通知は、発生が１４秒、回復が１２秒、２６秒となる。このように障害ばたつき状態が発生するまでは、正確にオペレーション装置６へ障害の発生および回復の通知がされ、障害ばたつき状態が確認された後はオペレーション装置６への不要な通知を削減することが出来る。
【００３３】
以上のように、この実施の形態３によれば、制御部４が過去５秒間の障害の発生および回復検知数を累計することにより障害ばたつき状態が発生していないか監視し、障害ばたつき状態が回復するまでの間はオペレーション装置６への通知を削減するようにしたので、通常状態の時や障害ばたつき状態が確認される直前までの障害多発時には、検知された障害の発生および回復をとりこぼしなく通知し、障害ばたつき状態の間は通知量を削減してシステム管理者にとって必要な障害のみ通知出来るという効果が得られる。
【００３４】
実施の形態４．
実施の形態４による障害監視装置の構成は、図１に示されるものと同一であるが、制御部４の動作が異なる。実施の形態４でも、制御部４は障害ばたつき状態の時と通常状態の時でオペレーション装置６への通知の規則を変えるが、動作は実施の形態３と異なる。
【００３５】
図９は、この発明の実施の形態４による、制御部４が行う障害ばたつき状態監視のフローチャートである。
障害ばたつき状態の監視は、実施の形態３と同様に、０秒前（現在）から４秒前までの障害の発生および回復検知数を累計することにより行う。実施の形態４では、ステップＳＴ１０７で制御部４により、ステップＳＴ１０６で得た０秒前から４秒前までの障害発生および回復検知数の累計が５件以上と判断されると、ステップＳＴ８０８に進む。ステップＳＴ８０８では、制御部４はばたつきフラグをＯＮにするとともに、出力部５経由でオペレーション装置６に対し、障害ばたつき状態発生通知を送信する。ステップＳＴ１０７で５件未満であればステップＳＴ１０９にスキップする。ステップＳＴ１０９で制御部４は、ステップＳＴ１０６で得た累計が０件であるかどうか判定し、０件である場合は、障害ばたつき状態は解消しているものと判断し、ステップＳＴ８１０でばたつきフラグをＯＦＦにするとともに、出力部５経由でオペレーション装置６に対し、障害ばたつき状態回復通知を送信する。ステップＳＴ１０９で、累計が０件でなかった場合は、ステップＳＴ１１１にスキップする。ステップＳＴ１１１では０秒前から４秒前の障害発生および回復検知数として記憶されている値を、次回の障害ばたつき状態判断のため、１秒前から５秒前までの障害発生および回復検知数にシフトして格納し、ステップＳＴ１０４に戻る。
【００３６】
図１０は、この発明の実施の形態４による障害通知の制御処理のフローチャートである。ステップＳＴ８１で制御部４は、ばたつきフラグを確認し、ばたつきフラグがＯＮである場合は、出力部５への通知は一切行わずステップＳＴ１５へスキップする。ステップＳＴ８１でばたつきフラグがＯＦＦである場合は、通常状態であると判断され、ステップＳＴ４へ進む。ステップＳＴ４で制御部４が障害検知部３からの通知により障害の回復を検知し、ステップＳＴ９で、警報フラグがＯＮになっていることを確認すると、現在は障害が発生している状態と判断され、ステップＳＴ１３で、障害は回復したものとし、制御部４は警報フラグをＯＦＦに設定する。次に、ステップＳＴ１４で障害の回復を出力部５経由でオペレーション装置６へ通知する。ステップＳＴ９で警報フラグがＯＦＦになっている場合は、ステップＳＴ１５へ進み、次回周期の監視まで待機する。
【００３７】
ステップＳＴ４で、障害の発生が通知されたときは、実施の形態１と同様に、ステップＳＴ５からステップＳＴ１５までの処理を行い、オペレーション装置６に、障害の発生を通知する。
【００３８】
図１１は、この発明の実施の形態４による障害監視装置によるオペレーション装置への障害通知例を示す図である。障害検知部３により検知された障害の発生および回復は図８と同一条件になっており、「障害発生／回復累計」が５件以上になった１０秒から０件になった２９秒までは、警報ばたつきフラグはＯＮの状態になっている。図中、「オペレーション装置へのばたつき通知」は、制御部４が、ばたつき状態と判断したときにオペレーション装置６へ送信される、障害ばたつき状態発生通知および制御部４がばたつき状態が回復したと判断した時に送信される、障害ばたつき状態回復通知の送信タイミングを示している。図より、警報ばたつきフラグがＯＦＦの間は、制御部４は障害検知部３で検知された障害発生および回復をそのまま出力部５を経由してオペレーション装置６へ通知する。１０秒目以降は、障害ばたつき状態が解消される２９秒まで、障害の発生および回復は一切オペレーション装置６へ通知されない。このように、障害ばたつき状態が発生するまでは、正確にオペレーション装置６へ障害の発生および回復の通知がされ、障害ばたつき状態が確認された後はオペレーション装置６への不要な通知を行わないようにすることが出来る。
【００３９】
以上のように、この実施の形態４によれば、制御部４が過去５秒間の障害の発生および回復検知数を累計することにより障害ばたつき状態になっていないかどうか監視し、障害ばたつき状態が発生した場合には制御部４はオペレーション装置６に対し障害ばたつき状態発生通知を送信し、障害ばたつき状態が解消した場合には障害ばたつき状態回復通知を送信するようにしたので、システム管理者は現在の障害発生の状態が把握出来る。また、障害ばたつき状態の間は、オペレーション装置６への通知を全く行わないようにしたので、不要な障害通知を行わず、通常状態の時や障害ばたつき状態が確認されるまでの障害多発時には、障害の発生および回復を取りこぼしなく通知出来るという効果が得られる。
【００４０】
【発明の効果】
以上のように、この発明によれば、障害が瞬時に発生と回復を繰り返すような場合でも、障害を取りこぼすことなく抽出し、さらに、不要な障害発生および回復の通知を削減することが出来るので、システム監視品質を低下させずにシステム管理者にとって必要な情報のみ通知出来ると共に、障害の発生および回復の検出時刻は正確に通知出来るという効果がある。
【００４２】
この発明によれば、障害の発生と回復の繰り返しが頻繁に起こっている場合には、障害の発生および回復の検出が一定数以上になった時点で、オペレーション装置への通知量を削減出来るという効果がある。
【００４３】
この発明によれば、障害の発生と回復の繰り返しが頻繁に起こっている場合には、障害の発生および回復の検出が一定数以上になった時点でオペレーション装置へその旨を通知し、障害の通知を中断することが出来るという効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１による障害監視装置の構成を示すブロック図である。
【図２】この発明の実施の形態１による障害通知の制御処理のフローチャートである。
【図３】この発明の実施の形態１による障害監視装置によるオペレーション装置への障害通知例を示す図である。
【図４】この発明の実施の形態２による障害通知の制御処理のフローチャートである。
【図５】この発明の実施の形態２による障害監視装置によるオペレーション装置への障害通知例を示す図である。
【図６】この発明の実施の形態３による障害ばたつき状態監視のフローチャートである。
【図７】この発明の実施の形態３による障害通知の制御処理のフローチャートである。
【図８】この発明の実施の形態３による障害監視装置によるオペレーション装置への障害通知例を示す図である。
【図９】この発明の実施の形態４による障害ばたつき状態監視のフローチャートである。
【図１０】この発明の実施の形態４による障害通知の制御処理のフローチャートである。
【図１１】この発明の実施の形態４による障害監視装置によるオペレーション装置への障害通知例を示す図である。
【図１２】従来の障害監視装置を示す構成図である。
【図１３】従来の障害監視装置によるオペレーション装置への障害通知例を示す図である。
【符号の説明】
１監視装置、２被監視部、３障害検知部、４制御部、５出力部（制御部）、６オペレーション装置。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a failure monitoring apparatus that detects and reports the occurrence and recovery of a failure that has occurred in various types of equipment to be monitored.
[0002]
[Prior art]
For example, in an optical communication system, it is necessary to constantly monitor communication quality, and when a failure occurs, it is necessary for the system administrator to notify the operation device used for monitoring, but the communication quality has deteriorated slightly. It often happens that failure occurs and recovers continuously in a short time. At that time, if the operation device is notified of all occurrences and recoveries, a large amount of notifications are issued, the load on the system increases, and the system administrator must also extract the notifications that are really necessary from the large number of notifications. It will be troublesome because it will not be. For this reason, it is desirable to adopt a method in which only necessary notifications are extracted from failure occurrence and recovery notifications that occur in large quantities and transmitted to the operation device. Conventionally, this is realized by notifying the operation device only when the occurrence and recovery of a failure is detected for a certain period of time, as in the failure monitoring device disclosed in Japanese Patent Laid-Open No. 5-175921, for example. It was.
[0003]
FIG. 12 is a block diagram showing a failure monitoring apparatus disclosed in Japanese Patent Laid-Open No. 5-175921. In the figure, 101 is a monitored unit, 102 is a monitoring unit, 103 is software, 104 is a reporting terminal as a monitoring operation device, 105 is a continuous occurrence recovery monitoring circuit, and 106 is a processing unit. The monitored unit 101 includes a processing unit 106 that performs processing originally intended by the apparatus such as communication, and a continuous occurrence recovery monitoring circuit 105 that detects a failure that occurs in the processing unit 106. The monitoring unit 102 has software 103 that notifies the reporting terminal 104 of the occurrence and recovery of the failure notified from the continuous occurrence recovery monitoring circuit 105.
[0004]
Next, the operation will be described.
The continuous occurrence recovery monitoring circuit 105 notifies the software 103 of the occurrence of the failure only when the failure that has occurred in the processing unit 106 continues for a certain time or more. The monitoring unit 102 processes and notifies the failure notified from the continuous occurrence recovery monitoring circuit 105 in a form that can be notified to the reporting terminal 104.
[0005]
Similarly, in the case of recovery, the continuous occurrence recovery monitoring circuit 105 notifies the recovery of the failure to the monitoring unit 102 only when the failure recovered in the processing unit 106 continues for a certain time or more, and the monitoring unit 102 notifies the reporting terminal 104 of the failure. Notification of recovery.
[0006]
FIG. 13 is a diagram illustrating a failure notification example to the operation device by the conventional failure monitoring device. Here, an example of notification to the reporting terminal 104 when the monitoring period is 1 second and the time for confirming the occurrence of a failure and the continuation of recovery for notification to the software 103 is 3 seconds is shown. In the figure, “detected failure occurrence / recovery” indicates the occurrence and recovery of a failure that occurred in the processing unit 106, and “continuous occurrence / recovery time” represents the duration of each. “Notification to operation device” indicates the timing when the reporting terminal 104 is notified of the occurrence and recovery of a failure. According to the figure, the failure occurred in 1, 3, 7, 9, 14, 18, 20, 27 seconds, and the recovery occurred in 2, 6, 8, 10, 17, 19, 24, 28 seconds. However, if the occurrence and recovery are not detected continuously for 3 seconds or more, the reporting terminal 104 is not notified. Therefore, the occurrence is notified in 5 and 16 seconds, and the recovery is notified only in 12 and 26 seconds. Is done. Here, for example, the failure that occurred in 27 seconds is recovered in 1 second, and is shorter than 3 seconds, which is the continuation confirmation time, so the reporting terminal 104 is not notified.
[0007]
[Problems to be solved by the invention]
Since the conventional fault monitoring apparatus is configured as described above, the report terminal 104 used by the system administrator is not notified unless the occurrence or recovery of the fault continues for a certain period of time. For this reason, if a failure that does not reach a certain time due to occurrence or recovery instantaneously occurs intermittently, the failure is not notified to the reporting terminal 104, leading to a decrease in system monitoring quality. was there.
[0008]
The present invention has been made to solve the above-described problems. Even when a failure repeatedly occurs and recovers instantly, a failure necessary for the system administrator is not missed and the system monitoring quality is lowered. It is an object of the present invention to obtain a failure monitoring apparatus that can reduce the amount of notification without causing a failure.
[0009]
[Means for Solving the Problems]
The fault monitoring apparatus according to the present invention detects a fault occurrence and recovery of a monitored part, outputs a detection signal, and operates simultaneously with confirmation of the detection signal when the detection signal indicates the occurrence of a fault. Outputs notification information indicating the failure occurrence status to the device. If the detection signal indicates failure recovery, outputs the notification information indicating the failure recovery status to the operation device when the recovery status continues for a certain period of time. In addition, the time when the failure recovery is detected by the failure detection unit is stored as the failure recovery time, and the failure recovery time is added to the notification information when the notification information indicating the failure recovery state is output to the operation device. And a control unit.
[0011]
The fault monitoring apparatus according to the present invention is configured to detect a fault occurrence and recovery of a monitored part and output a detection signal, and a notification information indicating a fault occurrence and recovery state at the same time as detecting the detection signal. If the frequency of the occurrence and recovery of failures detected within a certain time exceeds a certain number, it is determined that a failure has occurred frequently, and when the detection signal indicates failure recovery, the recovery status And a control unit that switches so as to output notification information indicating a state of failure recovery to the operation device when the operation continues for a predetermined time.
[0012]
The fault monitoring apparatus according to the present invention is configured to detect a fault occurrence and recovery of a monitored part and output a detection signal, and a notification information indicating a fault occurrence and recovery state at the same time as detecting the detection signal. In addition, when the frequency of the occurrence and recovery of failures detected within a certain time exceeds a certain number, it is determined as a frequent failure state, and the operation device is notified of the frequent failure state. And a control unit for stopping the output of notification information indicating the status of occurrence of failure and recovery to the operation device.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described below.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing the configuration of a failure monitoring apparatus according to Embodiment 1 of the present invention. In the figure, 1 is a monitoring device, 2 is a monitored unit, 3 is a failure detection unit, 4 is a control unit, 5 is an output unit (control unit), and 6 is an operation device. The monitoring device 1 is, for example, an optical communication device. The monitored unit 2 performs communication processing or the like that is the original purpose. The failure detection unit 3, the control unit 4, and the output unit 5 Monitor failure occurrence and recovery.
The operation device 6 is used by an administrator to monitor the monitoring device 1 and can monitor a failure of the monitoring device 1 by displaying on a display or the like. The failure notification data output from the control unit 4 is converted to a format that the output unit 5 can notify to the operation device 6 and output to the operation device 6.
[0014]
Next, the operation will be described.
When a failure occurs or recovers in the monitored unit 2, the failure detection unit 3 detects the failure. The occurrence and recovery of the failure detected by the failure detection unit 3 is output to the control unit 4. FIG. 2 is a flowchart of the failure notification control process performed by the control unit 4 according to the first embodiment of the present invention. Here, the monitoring cycle is 1 second, and the time for confirming the continuation in order to notify the operation device 6 of the failure recovery is 3 seconds.
[0015]
Steps ST1 and ST2 are initialization processes inside the control unit 4, each of which sets an alarm flag to OFF and sets a failure recovery duration to 0 seconds. The alarm flag represents the state of occurrence or recovery of the failure at the present time. If the alarm flag is OFF, the failure has not occurred. If the alarm flag is ON, the failure has continued. Monitoring by the control unit 4 repeats the steps from step ST3 to step ST15 at a constant cycle of 1 second.
[0016]
In step ST3, the control unit 4 confirms the presence or absence of a failure detection signal from the failure detection unit 3. In step ST4, it is determined whether the content of the detection signal is the occurrence of a failure or a recovery. If it is determined that a failure has occurred, the process proceeds to step ST5. If the recovery is determined to be a recovery, the process proceeds to step ST9.
[0017]
In step ST5, the control unit 4 refers to the alarm flag. If the alarm flag is ON, the control unit 4 determines that the failure is currently continuing and does not notify the output unit 5. The process proceeds to step ST15 and waits until the next cycle monitoring. If the alarm flag is OFF in step ST5, it is determined that no fault has occurred at present, and the control unit 4 turns the alarm flag ON in step ST6. Next, in step ST7, the control unit 4 notifies the output unit 5 of the occurrence of the failure, and the output unit 5 converts the failure occurrence notification into a format that can be notified to the operation device 6 and outputs it. Next, in step ST8, the failure recovery continuation time is set to 0 second, and the process proceeds to step ST15 to wait until the next cycle monitoring.
[0018]
In step ST9, the control unit 4 refers to the alarm flag. If the alarm flag is OFF, the controller 4 determines that no failure has occurred and the recovery state is continuing, and notifies the output unit 5. The process proceeds to step ST15 without waiting for the next cycle to be monitored. If the alarm flag is ON in step ST9, it is determined that the failure is currently continuing, and the process proceeds to step ST10 where the control unit 4 adds 1 second to the recovery continuation time. Next, in step ST11, the control unit 4 confirms the recovery continuation time. If it is less than 3 seconds, the control unit 4 determines that the condition for notifying failure recovery has not been reached, proceeds to step ST15, and waits for the next monitoring. To do. If the recovery continuation time is 3 seconds or more, the recovery continuation time is returned to 0 seconds in step ST12, the alarm flag is set to OFF in step ST13, and the failure recovery is performed via the output unit 5 in step ST14. 6 is notified. Thereafter, the process proceeds to step ST15 and waits until the next cycle monitoring.
[0019]
In step ST15, the control unit 4 stands by until the next monitoring cycle, and returns to step ST3.
[0020]
FIG. 3 is a diagram showing a failure notification example to the operation device by the failure monitoring device according to Embodiment 1 of the present invention. In the figure, “detected failure occurrence / recovery” indicates the occurrence and recovery of a failure in the monitored unit 2 detected by the failure detection unit 3, and is the same as the conventional failure notification example shown in FIG. It is a condition. “Continuous recovery time” represents the duration of the recovery situation. “Notification to operation device” indicates the timing at which the operation device 6 is notified of the occurrence and recovery of a failure. According to the figure, the failure occurred in 1, 3, 7, 9, 14, 18, 20, 27 seconds, and the recovery occurred in 2, 6, 8, 10, 17, 19, 24, 28 seconds. Yes. In the first embodiment, the occurrence of a failure is unconditionally, and when the recovery is detected continuously for 3 seconds or more, the operation device 6 is notified. Notification is made at 12, 26, and 30 seconds. In this case, the failure occurrence at 27 seconds and the failure recovery at 28 seconds, which could not be notified in the conventional example, can be notified to the operation device 6 at 27 seconds and 30 seconds, respectively.
[0021]
As described above, according to the first embodiment, the control unit 4 monitors the failure detection signal from the monitored unit 2 by the failure detection unit 3 at a constant period, and outputs a detection signal indicating the occurrence of the failure. In this case, the operation device 6 is unconditionally notified via the output unit 5, and when the detection signal indicates the recovery of the failure, the notification is made when the recovery notification continues for 3 seconds. Thus, it is possible to notify a failure that has occurred instantaneously without missing it, reducing the amount of notification without degrading the monitoring quality, and obtaining the effect of notifying a failure necessary for the system administrator.
[0022]
Embodiment 2. FIG.
The configuration of the failure monitoring apparatus according to the second embodiment is the same as that shown in FIG. 1, but the operation of the control unit 4 is different.
Next, the operation will be described.
FIG. 4 is a flowchart of the failure notification control process performed by the control unit 4 according to the second embodiment of the present invention. Steps ST1 and ST2 are initialization processes inside the control unit 4 as in FIG. Similarly to the first embodiment, monitoring by the control unit 4 repeats the steps from step ST3 to step ST15 at a constant cycle of every second.
[0023]
When the failure detection signal from the failure detection unit 3 is confirmed in step ST3, in step ST4, it is determined whether the content of the detection signal is the occurrence of a failure or recovery. If it is determined that the failure has occurred, step ST5 is performed. If it is determined that the recovery has occurred, the process proceeds to step ST9. When the process proceeds to step ST5, the subsequent processing is the same as in the first embodiment.
[0024]
In step ST9, the control unit 4 refers to the alarm flag, and if the alarm flag is OFF, it is determined that no failure has occurred (the recovery state continues), and the output to the output unit 5 is performed. The process proceeds to step ST15 without performing notification, and waits until the next cycle monitoring. If the alarm flag is ON in step ST9, it is determined that the failure is currently continuing, and the process proceeds to step ST41. In step ST41, the control unit 4 confirms the recovery continuation time. If the recovery continuation time is 0 second, it is determined that the failure has now shifted to recovery, and the control unit 4 detects that the failure detection unit 3 has recovered from the current time in step ST42. Save as time (failure recovery time). If it is confirmed in step ST41 that the recovery continuation time is not 0 second, it is determined that the failure recovery state is continued, and step ST42 is skipped and the process proceeds to step ST10.
[0025]
In step ST10, the control unit 4 adds 1 second to the recovery continuation time. Next, in step ST11, the control unit 4 confirms the recovery continuation time. If it is less than 3 seconds, the control unit 4 determines that the condition for notifying failure recovery has not been reached, proceeds to step ST15, and waits for the next monitoring. To do. If the recovery duration is 3 seconds or more, the recovery duration is returned to 0 seconds in step ST12, and the alarm flag is set to OFF in step ST13. In step ST43, the control unit 4 notifies the operation device 6 of the recovery of the failure via the output unit 5. At that time, the failure recovery time stored in the control unit 4 is added to the notification information to the operation device 6. Thereafter, the process proceeds to step ST15 and waits until the next cycle monitoring.
[0026]
FIG. 5 is a diagram showing a failure notification example to the operation device by the failure monitoring device according to Embodiment 2 of the present invention. Occurrence and recovery of a failure that has occurred in the monitored unit 2 are under the same conditions as in FIG. Similarly to the first embodiment, in the second embodiment, the failure recovery is notified to the operation device 6 after 3 seconds from the start of the failure recovery, but in the second embodiment, the failure recovery time is indicated in the notification information. The information of the accurate time without delay of 10 seconds, 24 seconds, and 28 seconds when the failure starts to be recovered is added, so that the monitoring quality can be improved.
[0027]
As described above, according to the second embodiment, the control unit 4 stores the time when the failure recovery is detected by the failure detection unit 3 as the failure recovery time, and notifies the operation device 6 of the failure recovery. Since the failure recovery time is notified at the same time, the failure occurrence and recovery detection time can be notified with an accurate time without delay, and the monitoring quality can be further improved. .
[0028]
Embodiment 3 FIG.
The configuration of the failure monitoring apparatus according to the third embodiment is the same as that shown in FIG. 1, but the operation of the control unit 4 is different. In the third embodiment, the control unit 4 monitors whether or not failure occurrence and recovery frequently occur in a short time in a failure flapping state (fault frequent occurrence state). The rule of notification to the operation device 6 is changed.
[0029]
FIG. 6 is a flowchart of fault flapping status monitoring performed by the control unit 4 according to the third embodiment of the present invention. Here, the control unit 4 considers that a failure flapping state occurs when occurrence and recovery of failure five times or more are detected in the past five seconds from the present time. Steps ST101 to ST103 are an initialization process for information stored in the control unit 4. In step ST101, the flutter flag is turned off. The fluttering flag indicates whether the state is a fluttering state or a normal state. If the flag is ON, the fluttering state is a fault state, and if it is OFF, the state is a normal state. Next, in step ST102, the failure occurrence and recovery total for 5 seconds is set to 0, and in step ST103, the occurrence of the failure detected by the failure detection unit 3 at each time point from 0 seconds ago (current) to 5 seconds ago Set the number of recovery to 0.
[0030]
Since the failure flapping state is monitored by accumulating the number of occurrences of faults and the number of recovery detections from 0 seconds ago (current) to 4 seconds ago, the control unit 4 causes the fault occurrence and recovery 5 seconds ago in step ST104. Clear the number of detections to zero. In step ST105, the control unit 4 stores the number of occurrences and restorations detected by the failure detection unit 3 as the number of occurrences and failures detected at the time point 0 seconds ago. Next, in step ST106, the number of fault occurrences and recovery detections from 0 seconds before to 4 seconds ago is accumulated, and in step ST107, the control unit 4 confirms whether the accumulated value obtained in step ST106 exceeds five. If the number exceeds five, it is determined that there is a fluttering failure state, and in step ST108, the control unit 4 sets the flapping flag to ON. If less than 5 in step ST107, skip to step ST109. In step ST109, the control unit 4 determines whether or not the cumulative total obtained in step ST106 is zero. If it is zero, it is determined that the fault flapping state has been resolved, and the flapping flag is set in step ST110. Set to OFF. If the cumulative total is not 0 in step ST109, the process skips to step ST111. In step ST111, the value stored as the number of fault occurrences and recovery detections from 0 seconds before to 4 seconds ago is used as the number of fault occurrences and recovery detections from 1 second before to 5 seconds before the next fault flapping state determination. The data is shifted and stored, and the process returns to step ST104.
[0031]
FIG. 7 is a flowchart of failure notification control processing performed by the control unit 4 according to the third embodiment of the present invention. As in the first embodiment, notification of failure occurrence and recovery from the control unit 4 to the output unit 5 is processed by repeatedly performing the steps from step ST3 to step ST15 at regular intervals of one second. In step ST4, the control unit 4 detects the recovery of the failure by a notification from the failure detection unit 3, and in step ST9, confirms that the alarm flag is ON. Then, the process proceeds to step ST61. In step ST61, the control unit 4 confirms the value of the flapping flag. If the flapping flag is ON, the control unit 4 determines that the fault flapping state is reached, and proceeds to step ST10. If the flapping flag is OFF, the control unit 4 is in the normal state. The process proceeds to step ST13. In the normal state, failure recovery continuation confirmation is not performed, so in step ST13 it is assumed that the failure has been recovered, and the control unit 4 sets the alarm flag to OFF. Next, in step ST14, the failure recovery is notified to the operation device 6 via the output unit 5. If it is determined in step ST61 that the failure is in a fluttering state, the continuation confirmation of failure recovery is performed by the processing from step ST11 to step ST14. If the recovery duration is 3 seconds or more in step ST11, it is assumed that the failure has been recovered, the recovery duration is set to 0 seconds in step ST12, the alarm flag is set to OFF in step ST13, and the output unit is set in step ST14. 5 is notified. When the recovery continuation time is less than 3 seconds in step ST11, the process proceeds to step ST15 and waits until the next monitoring.
When the occurrence of a failure is notified in step ST4, the processing from step ST5 to step ST15 is performed as in the first embodiment, and the operation device 6 is notified of the occurrence of the failure.
[0032]
FIG. 8 is a diagram showing a failure notification example to the operation device by the failure monitoring device according to Embodiment 3 of the present invention. From the figure, the occurrence of a failure detected by the failure detection unit 3 is 1, 3, 7, 9, 14, 18, 20 seconds, and recovery is 2, 6, 8, 10, 17, 19, 24. Is happening in seconds. In the figure, “accumulation of failure occurrence / recovery” indicates the number of failure occurrences and recovery detected in the past 5 seconds. The “alarm flapping flag” indicates ON / OFF of the alarm flapping flag. The alarm flapping is from 10 seconds when “failure occurrence / recovery cumulative” becomes 5 or more to 29 seconds when it becomes 0. The flag is in an ON state. While the alarm fluttering flag is OFF, the control unit 4 notifies the operation device 6 via the output unit 5 of the occurrence and recovery of the failure detected by the failure detection unit 3 as they are. While the alarm fluttering flag is ON, the control unit 4 notifies the operation device 6 via the output unit 5 when it is confirmed that the failure recovery continues for 3 seconds or more. For this reason, in the case of a failure flapping state, the notification to the operation device 6 is 14 seconds for occurrence, 12 seconds for recovery, and 26 seconds for recovery. Until the failure flapping state occurs in this manner, the operation device 6 is accurately notified of the occurrence and recovery of the failure, and after the failure flapping state is confirmed, unnecessary notifications to the operation device 6 can be reduced. I can do it.
[0033]
As described above, according to the third embodiment, the controller 4 monitors whether or not a failure flapping state has occurred by accumulating the number of occurrences of failure and the number of recovery detections in the past 5 seconds, and the failure flapping state is determined. Since the notification to the operation device 6 is reduced until the recovery, the occurrence and recovery of the detected failure is overlooked when the failure occurs frequently in the normal state or immediately before the failure flapping state is confirmed. The notification amount can be reduced during the fluttering state, and only the trouble necessary for the system administrator can be notified.
[0034]
Embodiment 4 FIG.
The configuration of the failure monitoring apparatus according to the fourth embodiment is the same as that shown in FIG. 1, but the operation of the control unit 4 is different. Also in the fourth embodiment, the control unit 4 changes the rule of notification to the operation device 6 in the state of fluttering failure and in the normal state, but the operation is different from that in the third embodiment.
[0035]
FIG. 9 is a flowchart of fault flapping status monitoring performed by the control unit 4 according to the fourth embodiment of the present invention.
In the same manner as in the third embodiment, the failure flapping state is monitored by accumulating the occurrences of failures and the number of recovery detections from 0 seconds ago (current) to 4 seconds ago. In Embodiment 4, when the control unit 4 determines in step ST107 that the total number of failures and recovery detections from 0 seconds to 4 seconds obtained in step ST106 is 5 or more, the process proceeds to step ST808. . In Step ST808, the control unit 4 turns on the flapping flag and transmits a fault flapping state occurrence notification to the operation device 6 via the output unit 5. If less than 5 in step ST107, skip to step ST109. In step ST109, the control unit 4 determines whether or not the cumulative total obtained in step ST106 is zero. If the cumulative number is zero, it is determined that the fault flapping state has been resolved, and the flapping flag is set in step ST810. In addition to turning OFF, a failure recovery state recovery notification is transmitted to the operation device 6 via the output unit 5. If the cumulative total is not 0 in step ST109, the process skips to step ST111. In step ST111, the value stored as the number of fault occurrences and recovery detections from 0 seconds before to 4 seconds ago is used as the number of fault occurrences and recovery detections from 1 second before to 5 seconds before the next fault flapping state determination. The data is shifted and stored, and the process returns to step ST104.
[0036]
FIG. 10 is a flowchart of the fault notification control process according to the fourth embodiment of the present invention. In step ST81, the control unit 4 checks the flapping flag. If the flapping flag is ON, the control unit 4 skips to step ST15 without performing any notification to the output unit 5. If the fluttering flag is OFF in step ST81, it is determined that the normal state is set, and the process proceeds to step ST4. In step ST4, the control unit 4 detects the recovery of the failure by a notification from the failure detection unit 3, and in step ST9, confirms that the alarm flag is ON. In step ST13, it is assumed that the failure has been recovered, and the control unit 4 sets the alarm flag to OFF. Next, in step ST14, the failure recovery is notified to the operation device 6 via the output unit 5. If the alarm flag is OFF in step ST9, the process proceeds to step ST15 and waits until the next cycle is monitored.
[0037]
When the occurrence of a failure is notified in step ST4, the processing from step ST5 to step ST15 is performed as in the first embodiment, and the operation device 6 is notified of the occurrence of the failure.
[0038]
FIG. 11 is a diagram showing a failure notification example to the operation device by the failure monitoring device according to Embodiment 4 of the present invention. The occurrence and recovery of the failure detected by the failure detection unit 3 is the same condition as in FIG. 8, and from 10 seconds when “failure occurrence / recovery total” becomes 5 or more to 29 seconds when it becomes 0. The alarm fluttering flag is ON. In the figure, “notification of flapping to the operation device” indicates that the flapping state occurrence notification and the control unit 4 have recovered from the fluttering state, which is transmitted to the operation device 6 when the control unit 4 determines that the flapping state has occurred. It shows the transmission timing of the failure recovery state recovery notification that is sent when a failure occurs. From the figure, while the alarm fluttering flag is OFF, the control unit 4 notifies the operation device 6 via the output unit 5 as it is of the occurrence and recovery of the failure detected by the failure detection unit 3. After the tenth second, the occurrence and recovery of the failure are not notified to the operation device 6 until 29 seconds when the failure fluttering state is resolved. In this way, until the failure flapping state occurs, the operation device 6 is accurately notified of the occurrence and recovery of the failure, and after the failure flapping state is confirmed, unnecessary notification to the operation device 6 is not performed. Can be.
[0039]
As described above, according to the fourth embodiment, the control unit 4 monitors whether or not the failure fluttering state has occurred by accumulating the number of occurrences of failure and the number of recovery detections in the past 5 seconds, and the failure flapping state is determined. When the failure occurs, the control unit 4 transmits a failure flapping state occurrence notification to the operation device 6, and when the failure flapping state is resolved, a failure flapping state recovery notification is transmitted. It is possible to grasp the state of failure occurrence. Further, since the notification to the operation device 6 is not performed at all during the failure flapping state, unnecessary trouble notification is not performed, and in the normal state or when there are frequent failures until the fluttering state is confirmed, The effect is that the occurrence and recovery of the failure can be notified without missing.
[0040]
【The invention's effect】
As described above, according to the present invention, even when a failure repeats occurrence and recovery instantaneously, it is possible to extract the failure without missing it, and to further reduce the occurrence of unnecessary failure occurrence and recovery. Therefore, only information necessary for the system administrator can be notified without degrading the system monitoring quality. In addition, it is possible to accurately notify the time of occurrence of failure and detection of recovery. There is an effect.
[0042]
According to the present invention, when the occurrence and recovery of failures frequently occur, the amount of notification to the operation device can be reduced when the occurrence and recovery of failures reaches a certain number or more. effective.
[0043]
According to the present invention, when the occurrence and recovery of failures frequently occur, when the occurrence and detection of failures reaches a certain number or more, the fact is notified to the operation device. There is an effect that the notification can be interrupted.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a failure monitoring apparatus according to Embodiment 1 of the present invention.
FIG. 2 is a flowchart of failure notification control processing according to Embodiment 1 of the present invention;
FIG. 3 is a diagram showing a failure notification example to the operation device by the failure monitoring device according to Embodiment 1 of the present invention;
FIG. 4 is a flowchart of failure notification control processing according to Embodiment 2 of the present invention;
FIG. 5 is a diagram showing a failure notification example to an operation device by a failure monitoring device according to Embodiment 2 of the present invention;
FIG. 6 is a flowchart of fault flapping status monitoring according to Embodiment 3 of the present invention.
FIG. 7 is a flowchart of failure notification control processing according to Embodiment 3 of the present invention;
FIG. 8 is a diagram showing a failure notification example to an operation device by a failure monitoring device according to Embodiment 3 of the present invention;
FIG. 9 is a flowchart of fault flapping status monitoring according to Embodiment 4 of the present invention.
FIG. 10 is a flowchart of failure notification control processing according to Embodiment 4 of the present invention;
FIG. 11 is a diagram showing a failure notification example to the operation device by the failure monitoring device according to Embodiment 4 of the present invention;
FIG. 12 is a block diagram showing a conventional failure monitoring apparatus.
FIG. 13 is a diagram showing an example of failure notification to an operation device by a conventional failure monitoring device.
[Explanation of symbols]
1 monitoring device, 2 monitored unit, 3 fault detection unit, 4 control unit, 5 output unit (control unit), 6 operation device.

Claims

A failure detection unit that detects a failure occurrence and recovery of the monitored unit and outputs a detection signal;
When the detection signal indicates the occurrence of a failure, notification information indicating the failure occurrence state is output to the operation device simultaneously with the confirmation of the detection signal, and when the detection signal indicates a failure recovery, the recovery state is When it continues for a certain period of time, it outputs notification information indicating the failure recovery status to the operation device ,
Further, the time when the failure detection is detected by the failure detection unit is stored as the failure recovery time, and the failure recovery time is added to the notification information when the notification information indicating the failure recovery state is output to the operation device. fault monitoring device and a control unit for.

A failure detection unit that detects a failure occurrence and recovery of the monitored unit and outputs a detection signal;
At the same time as confirming the detection signal, output notification information indicating the failure occurrence and recovery status to the operation device,
Also, if the frequency of occurrence and recovery of failures detected within a certain time exceeds a certain number, it is judged as a frequent failure state, and if the detection signal indicates failure recovery, the recovery state continues for a certain time And a control unit that switches to output notification information indicating a failure recovery status to the operation device at the time of failure.

A failure detection unit that detects a failure occurrence and recovery of the monitored unit and outputs a detection signal;
At the same time as confirming the detection signal, output notification information indicating the failure occurrence and recovery status to the operation device,
In addition, when the frequency of failure occurrence and recovery detected within a certain time exceeds a certain number, it is determined as a frequent failure state, notifying the operation device that it is a frequent failure state, A failure monitoring apparatus comprising: a control unit that stops outputting notification information indicating a failure occurrence and recovery status.