JP2008293441A

JP2008293441A - Method and apparatus for predicting device fault

Info

Publication number: JP2008293441A
Application number: JP2007140876A
Authority: JP
Inventors: Yasumitsu Kubomura; 泰光久保村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-05-28
Filing date: 2007-05-28
Publication date: 2008-12-04

Abstract

<P>PROBLEM TO BE SOLVED: To solve problems that since the calculation or the like of correlation to predict the generation of a device fault is necessary, a prediction apparatus is complicated and whether a certain phenomenon is a previous phenomenon related to a fault can be obtained from the past data of fault analysis only in fault analysis after fault generation. <P>SOLUTION: If the existence of a phenomenon causing a fault before the generation of the fault is detected from a log at the time of fault generation when the fault is generated in a device 2, the phenomenon (fault pre-information) causing the fault is registered in a fault pre-information database 21 by an administrator terminal 1. When the fault pre-information is previously known, the fault pre-information is previously registered in the fault pre-information database 21, and when the fault pre-information is known thereafter, the fault pre-information is registered in the fault pre-information database 21 as required. In this embodiment, possibility of previously predicting a fault can be improved by storing the fault pre-phenomenon in the database. A fault prediction report part 22 reports fault prediction to an administrator or a user. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は機器障害予測方法及び機器障害予測装置に係り、特に障害を監視する必要がある分野において、機器に発生する障害を予測する機器障害予測方法及び機器障害予測装置に関する。 The present invention relates to a device failure prediction method and a device failure prediction apparatus, and more particularly to a device failure prediction method and a device failure prediction device for predicting a failure occurring in a device in a field where the failure needs to be monitored.

機器の障害の発生を予測し、情報処理システムの性能を監視する性能監視装置が知られている（例えば、特許文献１参照）。図４はこの特許文献１に記載の性能監視システムの一例の構成図を示す。同図において、性能監視装置１０は蓄積サーバ１０１と分析サーバ１０２とから構成される。性能監視装置１０は、Ｗｅｂサーバ１１、ＡＰ（アプリケーション）サーバ１２、ＤＢ（データベース）サーバ１３から構成される情報処理システムとローカルエリアネットワーク（ＬＡＮ）等の通信回線で接続され、この通信回線を介して各サーバの状態を監視する。 2. Description of the Related Art A performance monitoring device that predicts the occurrence of a device failure and monitors the performance of an information processing system is known (see, for example, Patent Document 1). FIG. 4 shows a configuration diagram of an example of the performance monitoring system described in Patent Document 1. In FIG. 1, the performance monitoring apparatus 10 includes a storage server 101 and an analysis server 102. The performance monitoring apparatus 10 is connected to an information processing system including a Web server 11, an AP (application) server 12, and a DB (database) server 13 through a communication line such as a local area network (LAN), and the communication line is connected to the information processing system. To monitor the status of each server.

蓄積サーバ１０１は上記の各サーバ１１〜１３間を接続する通信回線で通信されるトランザクションのスループット、処理名等を示すトランザクションデータなどを監視データとして内部に蓄積する。分析サーバ１０２は、蓄積サーバ１０１に蓄積された監視データに基づいて、情報処理システムに現在発生している障害を検知したり、あるいは情報処理システムに将来発生する可能性のある障害を予測する。 The accumulation server 101 accumulates, as monitoring data, transaction data indicating the throughput, processing name, and the like of transactions communicated through the communication lines connecting the servers 11 to 13 described above. Based on the monitoring data stored in the storage server 101, the analysis server 102 detects a failure that currently occurs in the information processing system or predicts a failure that may occur in the information processing system in the future.

この性能監視装置では、複数の情報処理装置（サーバ１１〜１３）の稼動状況及び複数の情報処理装置間を接続する各通信回線のデータ通信状況を監視する監視手段（蓄積サーバ１０１）と、上記監視データに基づいて、情報処理システムに現在発生している障害を検知、又は情報処理システムに将来障害が発生する可能性を予測する障害検知／予測手段（分析サーバ１０２）とを有する構成である。 In this performance monitoring apparatus, the monitoring means (storage server 101) for monitoring the operation status of a plurality of information processing devices (servers 11 to 13) and the data communication status of each communication line connecting the plurality of information processing devices, This configuration includes a failure detection / prediction unit (analysis server 102) that detects a failure that currently occurs in the information processing system or predicts the possibility of a future failure in the information processing system based on the monitoring data. .

ここで、上記の障害検知／予測手段は、算出された複数種類の監視データの間の相関関係と、監視手段によって現在までに得られた複数種類の監視データの推移とに基づいて、情報処理装置に将来障害が発生する可能性があることを予測する。また、上記の障害検知／予測手段は、複数種類の監視データに基づいて算出された、情報処理システムの正常稼動時及び異常稼動時の少なくともいずれか一方の相関関係を用いて、情報処理システムに現在発生している障害を検知、又は情報処理システムに将来障害が発生する可能性を予測する。 Here, the failure detection / prediction unit performs information processing based on the correlation between the calculated types of monitoring data and the transition of the types of monitoring data obtained up to now by the monitoring unit. Predict that equipment may fail in the future. Further, the failure detection / prediction means uses the correlation between at least one of normal operation and abnormal operation of the information processing system calculated based on a plurality of types of monitoring data. Detect a failure that currently occurs, or predict the possibility that a failure will occur in the information processing system in the future.

特開２００５−３２７２６１号公報JP 2005-327261 A

しかしながら、機器障害予測装置の関連発明である上記の特許文献１記載の性能監視装置は、情報処理装置に将来障害が発生する可能性があることを予測するために、算出された複数種類の監視データの間の相関関係と、監視手段によって現在までに得られた複数種類の監視データの推移とに基づいて、あるいは、複数種類の監視データに基づいて算出された、情報処理システムの正常稼動時及び異常稼動時の少なくともいずれか一方の相関関係を用いて予測するため、計算量が膨大となり、また装置が複雑となる。 However, the performance monitoring device described in Patent Document 1 that is a related invention of the device failure prediction device is configured to calculate a plurality of types of monitoring to predict that a failure may occur in the information processing device in the future. During normal operation of the information processing system calculated based on the correlation between data and the transition of multiple types of monitoring data obtained to date by the monitoring means, or based on multiple types of monitoring data Since the prediction is performed using at least one of the correlations during abnormal operation, the amount of calculation becomes enormous and the apparatus becomes complicated.

装置を簡略化するには、相関関係の算出は行わず、また将来発生する可能性のあるすべての障害の予測を行わず、障害の契機となる現象（事前現象）が発生したことを検出することができればよい。しかしながら、ある現象が障害と関係がある事前現象であるかどうかは障害解析の過去データから障害発生後の障害解析時にしか得られない。 To simplify the device, do not calculate correlations, do not predict all possible failures that may occur in the future, and detect that a phenomenon (prior phenomenon) that triggers the failure has occurred I can do it. However, whether or not a certain phenomenon is a prior phenomenon related to a failure can be obtained only from the past failure analysis data at the time of failure analysis after the occurrence of the failure.

本発明は以上の点に鑑みなされたもので、障害の契機となる事前現象を予めデータベースに登録しておくことで、機器の現在の現象が登録された事前現象であるか否かに応じて次に発生する障害を予測することが可能な機器障害予測方法及び機器障害予測装置を提供することを目的とする。 The present invention has been made in view of the above points, and by registering in advance a pre-phenomenon that triggers a failure according to whether or not the current phenomenon of the device is a pre-registered pre-phenomenon. An object of the present invention is to provide a device failure prediction method and device failure prediction apparatus that can predict a failure that will occur next.

上記の目的を達成するため、第１の発明は、機器により構成されるシステムの変化を検出する第１のステップと、検出した変化が、障害の発生契機としてその直前に発生する障害事前現象と同じ現象によるものであるかどうか判定する第２のステップと、第２のステップにより障害事前現象と同じ現象によるものであるとの判定結果が得られたときは、その旨を通知する第３のステップとを含むことを特徴とする。 In order to achieve the above object, the first invention includes a first step of detecting a change in a system constituted by devices, and a failure prior phenomenon that occurs immediately before the detected change as a failure occurrence trigger. When the second step for determining whether or not the phenomenon is caused by the same phenomenon and the determination result obtained by the second step are caused by the same phenomenon as the failure prior phenomenon, the third step for notifying that effect is provided. And a step.

また、上記の目的を達成するため、第２の発明は、機器により構成されるシステムの変化を検出する検出手段と、検出した変化が、障害の発生契機としてその直前に発生する障害事前現象と同じ現象によるものであるかどうか判定する判定手段と、判定手段により障害事前現象と同じ現象によるものであるとの判定結果が得られたときは、その旨を通知する通知手段とを有することを特徴とする。 In order to achieve the above object, the second invention includes a detecting means for detecting a change in a system constituted by devices, and a failure prior phenomenon that occurs immediately before the detected change as a failure occurrence trigger. A determination means for determining whether or not the phenomenon is caused by the same phenomenon, and a notification means for notifying that when the determination means obtains a determination result that the phenomenon is caused by the same phenomenon as the failure prior phenomenon. Features.

本発明によれば、機器の障害の発生契機としてその直前に発生する障害事前現象である障害事前現象情報に基づき、機器の障害発生を障害が発生する前に予測して管理者に認識させることができる。 According to the present invention, based on failure pre-phenomenon information that is a failure pre-occurrence phenomenon that occurs immediately before the occurrence of a device failure, predicting the occurrence of a device failure before the failure occurs and allowing the administrator to recognize it. Can do.

次に、本発明を実施するための最良の形態について図面と共に説明する。図１は本発明になる機器障害予測装置の一実施例のブロック図を示す。同図において、管理者端末１が機器２に接続されている。機器２は、機器本来の動作に必要な各部分（図示せず）に加えて、障害事前情報データベース２１と障害予測通知部２２とを有している。これら障害事前情報データベース２１と障害予測通知部２２とは機器障害予測装置を構成している。 Next, the best mode for carrying out the present invention will be described with reference to the drawings. FIG. 1 shows a block diagram of an embodiment of a device failure prediction apparatus according to the present invention. In the figure, an administrator terminal 1 is connected to a device 2. The device 2 includes a failure prior information database 21 and a failure prediction notification unit 22 in addition to each part (not shown) necessary for the original operation of the device. The prior failure information database 21 and the failure prediction notification unit 22 constitute a device failure prediction apparatus.

管理者端末１は障害事前情報データベース２１へ障害の契機となる現象を登録するためにある。管理者端末１を操作する管理者は、機器２を使用するユーザである場合もあり、また、ユーザとは別である場合もあり得る。障害事前情報データベース２１には、機器２で障害が起きた場合、障害発生時のログから障害が発生する前に障害の契機となる現象があれば、その障害の契機となる現象（障害事前情報）が管理者端末１により登録（保存）される。この障害事前情報は、予め分かっているものについてはすべて障害事前情報データベース２１に予め登録され、その後分かった場合は、障害事前情報データベース２１に随時登録される。本実施例では、障害事前現象をデータベース化することで事前に障害を予測できる可能性が高くなる。 The administrator terminal 1 is for registering a phenomenon that causes a failure in the failure prior information database 21. The administrator who operates the administrator terminal 1 may be a user who uses the device 2 or may be different from the user. In the failure prior information database 21, if a failure occurs in the device 2, if there is a phenomenon that triggers the failure before the failure occurs from the log at the time of failure occurrence, the phenomenon that triggers the failure (failure prior information ) Is registered (saved) by the administrator terminal 1. As for this prior failure information, all previously known information is registered in advance in the failure prior information database 21, and if found later, it is registered in the failure prior information database 21 as needed. In the present embodiment, the possibility that a failure can be predicted in advance increases by creating a database of failure prior phenomena.

障害事前情報データベース２１は記憶機能だけでなく、判断機能も有しており、登録されている障害事前情報の中のどれか一つの障害事前情報と同じ現象が機器２に発生した場合、その現象に対応した障害事前情報に基づき、次に予測される障害情報を予測して障害予測通知部２２に通知する。障害予測通知部２２は、発光ダイオード（ＬＥＤ）などによる視覚的通知、あるいはアラーム音による聴覚的通知により、更には画像表示により障害予測を管理者あるいはユーザに通知する。 The failure prior information database 21 has not only a storage function but also a determination function. When the same phenomenon as any one of the registered failure prior information occurs in the device 2, the phenomenon Is predicted next, and notifies the failure prediction notification unit 22 of the predicted failure information. The failure prediction notification unit 22 notifies the administrator or user of the failure prediction by visual notification using a light emitting diode (LED) or the like, or by auditory notification using an alarm sound, and further by image display.

機器２としては、例えばＰＩＣＭＧ（PCI Industrial Computer Manufacturers Group）が策定した次世代の通信機器向け標準規格(AdvancedTCA:Advanced Telecom Computing Architecture)によるＡＴＣＡスイッチなどがあり、また、障害の例としてはポートリンクダウン(Port Link Down)があり、障害事前情報の例としてはポートリンクアップ／ダウンがある。すなわち、ＡＴＣＡスイッチにおいては、ポートリンクアップ／ダウンを起こした場合、その後ポートリンクダウンになってしまうことがある。 The device 2 includes, for example, an ATCA switch based on a standard for advanced communication equipment (AdvancedTCA: Advanced Telecom Computing Architecture) established by PCI Industrial Computer Manufacturers Group (PICMG). (Port Link Down), and an example of failure prior information is port link up / down. That is, in the ATCA switch, when the port link up / down occurs, the port link may go down thereafter.

ここで、管理者端末１と機器２とからなるシステムには、例えば図２（Ａ）に示すように、複数台（ここでは一例として３台）の機器Ａ、Ｂ、Ｃを１台の管理装置４で統括的に管理するシステムや、図２（Ｂ）に示すように、１台の機器と１台の管理装置とが一体となった単一機器システム５がある。これらのシステムのうち図２（Ａ）に示したシステムでは、機器Ａ、Ｂ、Ｃを統括管理する管理装置４に、また同図（Ｂ）に示した単一機器システム５の機器に、前述した事前情報データベース２１と障害予測通知部２２とからなる機器障害予測装置を有している。 Here, in a system including the administrator terminal 1 and the device 2, for example, as shown in FIG. 2A, a plurality of devices (here, three as an example) devices A, B, and C are managed as one device. As shown in FIG. 2B, there is a single device system 5 in which one device and one management device are integrated, as shown in FIG. Among these systems, in the system shown in FIG. 2 (A), the management device 4 that manages and manages the devices A, B, and C and the device of the single device system 5 shown in FIG. The apparatus failure prediction apparatus is composed of the prior information database 21 and the failure prediction notification unit 22.

次に、図１の実施例の動作について、図３の本発明の機器障害予測方法の一実施例のフローチャートを併せ参照して説明する。管理者端末１から障害事前情報データベース２１へ障害の契機となる現象を示す障害事前情報を予め登録する（ステップＳ１）。続いて、管理装置が図２（Ａ）又は（Ｂ）に示したシステムの変化を検出する（ステップＳ２）。なお、管理装置は図１では図示を省略してある。このシステムの変化は、システムを構成する一台又は複数台の機器が、機器本来の動作以外の何らかの現象を発生したことにより生じる。 Next, the operation of the embodiment of FIG. 1 will be described with reference to the flowchart of one embodiment of the device failure prediction method of the present invention shown in FIG. Pre-failure information indicating a phenomenon that causes a failure is registered in advance from the administrator terminal 1 in the failure pre-information database 21 (step S1). Subsequently, the management device detects a change in the system shown in FIG. 2 (A) or (B) (step S2). The management device is not shown in FIG. This change in the system occurs when one or more devices constituting the system generate some phenomenon other than the original operation of the device.

続いて、上記の検出された変化が、障害事前情報データベース２１に登録された障害事前情報と同じ現象であるかどうか、障害事前情報データベース２１が判定し（ステップＳ３）、上記の検出された変化が、登録された障害事前情報と同じ現象である場合は、その旨を障害予測通知部２２に通知する。これにより、障害予測通知部２２は障害発生の契機となる現象が発生したことを、すなわち、予測される障害を管理者に報知する（ステップＳ４）。 Subsequently, the failure prior information database 21 determines whether or not the detected change is the same phenomenon as the failure prior information registered in the failure prior information database 21 (step S3), and the detected change is detected. However, when the failure is the same phenomenon as the registered failure prior information, the failure prediction notification unit 22 is notified to that effect. As a result, the failure prediction notification unit 22 notifies the administrator of the occurrence of a phenomenon that triggers the failure, that is, the predicted failure (step S4).

ここで、障害事前情報データベース２１に登録された障害事前情報が複数あり、そのうちのどの障害事前情報と一致するかを区別して報知する場合は、光の点滅周期、音の断続する長短のパターン、あるいは画面に表示されるメッセージなどで区別する。これらの方法は、公知であるので、その詳細な説明は省略する。これにより、管理者は障害が起きる可能性を認識することが可能となり、その予測される障害がシステムの動作に影響を与えるようなものであれば、該当する機器の交換などを行う。 Here, when there are a plurality of failure prior information registered in the failure prior information database 21 and the failure prior information is identified and notified, the light blinking cycle, the long and short pattern of sound intermittently, Or, it is distinguished by a message displayed on the screen. Since these methods are publicly known, detailed description thereof is omitted. As a result, the administrator can recognize the possibility of a failure, and if the predicted failure affects the operation of the system, the relevant device is replaced.

一方、ステップＳ２で検出された変化が、障害事前情報データベース２１に登録された障害事前情報の中には存在しない現象であるとステップＳ３で判定された場合は、その現象（システムの変化）が障害発生の契機となる現象ではないと判断して、障害事前情報データベース２１から障害予測通知部２２へは何の通知も行われない（ステップＳ５）。これにより、障害予測通知部２２は何の通知も行わない。 On the other hand, when it is determined in step S3 that the change detected in step S2 is a phenomenon that does not exist in the failure prior information registered in the failure prior information database 21, the phenomenon (system change) is detected. It is determined that this is not a phenomenon that triggers the occurrence of a failure, and no notification is sent from the failure prior information database 21 to the failure prediction notification unit 22 (step S5). Thereby, the failure prediction notification unit 22 does not perform any notification.

このように、本実施例によれば、障害の契機となる現象を示す障害事前情報を予め障害事前情報データベース２１に登録しておき、システム変化を伴う現象が生じる毎に、その現象が障害事前情報データベース２１に登録されているかどうかを判定するようにしたため、発生した現象が障害事前情報データベース２１に登録されていると判定したときは、その現象を契機として発生する障害を、その障害が発生する前に予測して管理者に通知することができる。また、本実施例によれば、相関関係の算出を行って将来発生する可能性のあるすべての障害の予測を行う必要がないため、機器の構成を簡略化することができる。 As described above, according to the present embodiment, failure prior information indicating a phenomenon that triggers a failure is registered in the failure prior information database 21 in advance, and each time a phenomenon accompanied by a system change occurs, the phenomenon is detected in advance. Since it is determined whether or not it is registered in the information database 21, when it is determined that the phenomenon that has occurred is registered in the fault prior information database 21, the fault that occurs when the phenomenon occurs is referred to as the failure. You can predict and notify the administrator before you do. Further, according to the present embodiment, since it is not necessary to calculate a correlation and predict all failures that may occur in the future, the configuration of the device can be simplified.

なお、本発明は上記の実施例に限定されるものではなく、例えば、管理者端末１が障害事前情報データベース２１に障害事前情報を登録する場合だけでなく、障害が発生した際、自動的にログから障害発生契機を障害事前情報データベース２１に登録する方法も考えられる。 In addition, this invention is not limited to said Example, For example, not only when the administrator terminal 1 registers failure prior information in the failure prior information database 21, but automatically when a failure occurs A method of registering the failure occurrence trigger from the log in the failure prior information database 21 is also conceivable.

本発明の機器障害予測装置の一実施例のブロック図である。It is a block diagram of one Example of the apparatus failure prediction apparatus of this invention. 本発明の機器障害予測装置が適用されるシステムの各例のブロック図である。It is a block diagram of each example of the system to which the equipment failure prediction apparatus of the present invention is applied. 本発明の機器障害予測方法の一実施例のフローチャートである。It is a flowchart of one Example of the apparatus failure prediction method of this invention. 特許文献１に記載の性能監視システムの一例の構成図である。1 is a configuration diagram of an example of a performance monitoring system described in Patent Literature 1. FIG.

Explanation of symbols

１管理者端末
２機器
２１障害事前情報データベース
２２障害予測通知部 DESCRIPTION OF SYMBOLS 1 Administrator terminal 2 Apparatus 21 Failure prior information database 22 Failure prediction notification part

Claims

A first step of detecting a change in a system constituted by devices;
A second step of determining whether or not the detected change is due to the same phenomenon as the failure prior phenomenon that occurs immediately before the occurrence of the failure;
A device failure prediction method comprising: a third step of notifying when the determination result that the failure is caused by the same phenomenon as the failure prior phenomenon is obtained in the second step.

A method for predicting the occurrence of a failure of the device by an administrator who manages a system composed of one or a plurality of devices,
A first step of registering a phenomenon that occurs immediately before the occurrence of the failure in the database as failure prior information when the failure of the device occurs;
A second step of detecting a change in the system;
A third step of determining whether the detected change is due to the same phenomenon as the prior failure information registered in the database;
And a fourth step of notifying the administrator to that effect only when a determination result that the same phenomenon as the failure prior information is obtained by the third step is obtained. Failure prediction method.

The apparatus according to claim 2, further comprising a fifth step of registering, in the database, a phenomenon that occurs immediately before the occurrence of a failure when the failure occurs in the device. Failure prediction method.

The method further comprising: when a failure occurs in the device, automatically registering a phenomenon of a failure occurrence trigger from the log as the failure prior information in the database when the failure occurs. 2. The device failure prediction method according to 2.

A plurality of the prior failure information is registered in the database, and the fourth step indicates which failure prior information is the same among the plurality of previous failure information by light, sound or image. The device failure prediction method according to claim 2, wherein the determination results are distinguished and notified.

Detecting means for detecting a change in a system constituted by devices;
Determining means for determining whether the detected change is due to the same phenomenon as the failure prior phenomenon that occurs immediately before the occurrence of the failure;
A device failure prediction apparatus, comprising: a notification means for notifying that when the determination means obtains a determination result that is due to the same phenomenon as the failure prior phenomenon.

An apparatus for predicting the occurrence of a failure of the device by an administrator who manages a system composed of one or a plurality of devices,
A phenomenon that occurs immediately before the occurrence of the failure of the device as a trigger for occurrence of the failure is a database registered as failure prior information, and
Registration means for previously registering the prior failure information in the database;
Detecting means for detecting a change in the system;
Determining means for determining whether the detected change is due to the same phenomenon as the prior failure information registered in the database;
An apparatus failure prediction apparatus comprising: notification means for notifying the administrator only when the determination means obtains a determination result that is due to the same phenomenon as the prior failure information.

8. The apparatus failure prediction apparatus according to claim 7, further comprising means for registering, in the database, a phenomenon that occurs immediately before the occurrence of a failure when the failure occurs in the device. .

The apparatus according to claim 1, further comprising means for automatically registering, in the database, a failure occurrence trigger phenomenon from the log as the failure prior information when a failure occurs in the device. 8. The apparatus failure prediction apparatus according to 7.

A plurality of the prior failure information is registered in the database, and the notification means indicates, by light, sound, or image, which failure prior information is the same phenomenon among the plurality of previous failure information. The apparatus failure prediction apparatus according to any one of claims 7 to 9, wherein the result is distinguished and notified.