JP2016057658A

JP2016057658A - Fault information management system and fault information management method

Info

Publication number: JP2016057658A
Application number: JP2014181010A
Authority: JP
Inventors: 恵美子宮崎; Emiko Miyazaki
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-09-05
Filing date: 2014-09-05
Publication date: 2016-04-21

Abstract

PROBLEM TO BE SOLVED: To provide a fault information management system in which, when a fault occurs, the factor that caused a fault and the situation of a function call are checked and determination is made as to the possibility of the same fault or not, and a fault information file is not written out in only the case of a fault attributable to the same cause.SOLUTION: The present invention is a fault information management system for managing an information file that includes fault information automatically generated at the time of process abnormality in a computer system, wherein the fault information management system includes dump request means for requesting that an information file that includes fault information be generated at the time of process abnormality, a system dump collection unit for generating the information file that includes fault information at request from the dump request means, a storage unit for storing the information file that includes fault information generated by the system dump collection unit, and an external storage device for storing a previous information file that includes previous fault information.SELECTED DRAWING: Figure 1

Description

本発明は、障害情報管理システムおよび障害情報管理方法に関し、特に同一の障害が発生した場合の障害情報管理システムおよび障害情報管理方法に関する。 The present invention relates to a failure information management system and a failure information management method, and more particularly to a failure information management system and a failure information management method when the same failure occurs.

一般に、コンピュータネットワーク上のノードにおいて障害が発生した場合は、ノードのメモリイメージ（コンテキスト、スタック、データ）を障害情報としてダンプファイルに出力し、そのダンプファイルを解析することによって障害の原因を解析するという方法がとられている。ダンプファイルは、ダンプ領域に作成され、ダンプ領域は、１つまたは複数のノードによって共有される場合がある。 In general, when a failure occurs in a node on a computer network, the node's memory image (context, stack, data) is output as failure information to a dump file, and the cause of the failure is analyzed by analyzing the dump file. The method is taken. A dump file is created in a dump area, which may be shared by one or more nodes.

通常、障害発生後は、ノードを再立ち上げする運用となっているため、複数のノードでダンプ領域を共有している場合に、あるノードが立ち上げ中に障害が発生すると、障害−ダンプ採取−再立ち上げ−障害を繰り返す可能性があり、繰り返し同じダンプファイルが採取される可能性がある。 Normally, after a failure occurs, the operation is to restart the node. Therefore, when a dump area is shared while multiple nodes share the dump area, the failure-dump collection -Re-startup-Possible failure is repeated and the same dump file may be collected repeatedly.

また、ジョブ運用をおこなっている場合、障害後の再立ち上げ時には、障害で実行を中断されたジョブを再実行する運用になっている場合が多い。このような場合に、そのジョブが原因となり障害が発生する場合に、障害−ダンプ採取−再立ち上げ−ジョブ再実行−障害を繰り返す可能性があり、繰り返し同じダンプファイルが採取される可能性がある。 In addition, when a job operation is performed, when a job is restarted after a failure, the job whose execution is interrupted due to the failure is often re-executed. In such a case, if a failure occurs due to the job, there is a possibility that failure-dump collection-restart-job re-execution-failure may be repeated, and the same dump file may be collected repeatedly. is there.

また、ＭＰＩ（ＭｅｓｓａｇｅＰａｓｓｉｎｇＩｎｔｅｒｆａｃｅ）のような分散並列プログラムは、同期を取って、各ノードで同じ処理が実行されるケースがほとんどであるため、複数のノードで同じ障害が発生し、複数ノードでダンプ領域を共有していた場合、複数ノードから、同じダンプファイルが採取される可能性がある。 In addition, distributed parallel programs such as MPI (Message Passing Interface) are often synchronized and the same processing is executed at each node. If the dump area is shared, the same dump file may be collected from multiple nodes.

このようなケースでは、ダンプ領域の残り容量が少なくなっているときに、他のノードで別な障害が発生した場合に、ダンプ容量が足らずに必要な情報が採取できない可能性がある。さらに、複数のダンプファイルの解析に時間がかかるという問題もある。 In such a case, when the remaining capacity of the dump area is low, if another failure occurs in another node, there is a possibility that necessary information cannot be collected due to insufficient dump capacity. Furthermore, there is a problem that it takes time to analyze a plurality of dump files.

上記に関連する技術として、特許文献１には、外部記憶装置に保管されている情報ファイルの数が予め設定された上限値に達せず、同じ情報ファイルが既に外部記憶装置に存在せずかつ情報ファイルが解析可能な完全なファイルである場合、生成された情報ファイルを外部記憶装置に保管し、それ以外の場合に生成された情報ファイルを削除する技術が開示されている。 As a technique related to the above, Patent Document 1 discloses that the number of information files stored in an external storage device does not reach a preset upper limit, the same information file does not already exist in the external storage device, and information When the file is a complete file that can be analyzed, a technique is disclosed in which the generated information file is stored in an external storage device, and the generated information file is deleted in other cases.

また、特許文献２には、ノードから受信した障害情報に対して所定の類似条件を満たす障害情報が格納されているか否かを判断し、その判断結果に応じて、ノードにおける障害発生時にプロセスによって利用されていた主記憶装置の内容を含む障害解析情報の出力可否を表す出力可否情報をノードに送信し、受信された障害情報に対して所定の類似条件を満たす障害情報が障害情報格納部に格納されていないと判断された場合、受信された障害情報を障害情報格納部に登録する技術が開示されている。 Further, in Patent Document 2, it is determined whether or not failure information satisfying a predetermined similarity condition is stored for failure information received from a node, and depending on the determination result, depending on the process, Output enable / disable information indicating whether or not failure analysis information including the contents of the used main storage device is output to the node, and the failure information satisfying a predetermined similarity condition with respect to the received failure information is stored in the failure information storage unit. A technique for registering received failure information in a failure information storage unit when it is determined that it is not stored is disclosed.

特開平１１−０２４９６６号公報Japanese Patent Laid-Open No. 11-024966 特開２０１４−０９６０３１号公報JP 2014-096031 A

しかしながら、特許文献１は、同じプロセスが繰り返し再実行した場合で、かつ繰り返し障害にあった場合、この繰り返し障害にあった障害箇所が、必ずしも前に発生した障害と一緒かどうかは判断しておらず、単なる予測として、同じプロセスが繰り返し傷害にあったならば、同じ問題と判断している。過去の障害と次の障害では違う問題が起きている可能性もあるが、そのような場合も個数制限によって、障害情報ファイルが書き出されない可能性がある、という課題があった。 However, in Patent Document 1, when the same process is repeatedly re-executed and there is a repeated failure, it is not determined whether or not the failure location that was in the repeated failure is always the same as the failure that occurred before. Rather, as a mere prediction, if the same process is repeatedly injured, it is judged as the same problem. There is a possibility that a different problem occurs between the previous failure and the next failure. In such a case, however, there is a problem that the failure information file may not be written due to the number limitation.

また、特許文献２は、ノード間通信を前提として、ノード間通信を使った同一分散並列プログラム内で、同種の障害解析ファイルを書き出さないというものであり、複数システムでのダンプ領域を共有している場合の同種の障害解析ファイルを書き出しているわけではない、という課題があった。 Further, Patent Document 2 assumes that inter-node communication is assumed, and does not write out the same kind of failure analysis file in the same distributed parallel program using inter-node communication, and shares dump areas in a plurality of systems. There was a problem that the same kind of failure analysis file was not written out.

本発明の目的は、この点を鑑みたものであり、障害が起こった場合、同じ障害かどうかを判断し、同じ原因の障害の場合に障害情報ファイルが書き出されないようにする障害情報管理システムを提供することにある。 The object of the present invention is to solve this problem. When a failure occurs, the failure information management system determines whether the failure is the same and prevents the failure information file from being written in the case of the failure of the same cause. Is to provide.

本発明では、上記課題を解決するために、コンピュータシステムのプロセス異常時に自動的に生成される障害情報を含む情報ファイルを管理する障害情報管理システムにおいて、プロセス異常時に障害情報を含む情報ファイルを生成するよう要求するダンプ要求手段と、ダンプ要求手段からの要求によって障害情報を含む情報ファイルを生成するシステムダンプ採取部と、システムダンプ採取部で生成された障害情報を含む情報ファイルを記憶する記憶部と、以前の障害情報を含む以前の情報ファイルを記憶する外部記憶装置とを有し、外部記憶装置は、コンピュータシステムに少なくとも１台具備され、コンピュータシステムを構成するそれぞれのコンピュータから、プロセス異常時に自動的に生成される障害情報を記憶し、システムダンプ採取部は、障害情報と以前の障害情報の内容を比較して一致するかどうかを判定するダンプファイル出力判定手段を有していることを特徴としている。 In the present invention, in order to solve the above problems, in a failure information management system for managing an information file including failure information automatically generated when a process of the computer system is abnormal, an information file including failure information is generated when the process is abnormal A dump request means for requesting to perform, a system dump collection section for generating an information file including failure information in response to a request from the dump request section, and a storage section for storing an information file including the failure information generated by the system dump collection section And an external storage device for storing a previous information file including previous failure information. At least one external storage device is provided in the computer system, and from each of the computers constituting the computer system, when a process error occurs Stores fault information that is automatically generated and Sampling unit is characterized by having a dump file output determining means for determining whether a match by comparing the contents of the failure information and the previous fault information.

また、本発明では、コンピュータシステムのプロセス異常時に自動的に生成される障害情報を含む情報ファイルを管理する障害情報管理システムの障害情報管理方法において、プロセス異常時に障害情報を含む情報ファイルを生成するよう要求するステップと、ダンプ要求手段からの要求によって障害情報を含む情報ファイルを生成するステップと、システムダンプ採取部で生成された障害情報を含む情報ファイルを記憶するステップと、以前の障害情報を含む以前の情報ファイルを記憶するステップとを有し、外部記憶装置は、コンピュータシステムに少なくとも１台具備され、コンピュータシステムを構成するそれぞれのコンピュータから、プロセス異常時に自動的に生成される障害情報を記憶するステップを有し、システムダンプ採取部は、障害情報と以前の障害情報の内容を比較して一致するかどうかを判定するステップを有していることを特徴としている。 According to the present invention, in a failure information management method for a failure information management system that manages an information file including failure information that is automatically generated when a process in the computer system is abnormal, an information file that includes failure information is generated when the process is abnormal. A step of requesting, a step of generating an information file including failure information in response to a request from the dump request means, a step of storing an information file including failure information generated by the system dump collecting unit, and previous failure information Including at least one external storage device included in the computer system, and from each of the computers constituting the computer system, fault information automatically generated in the event of a process abnormality is stored. System dump collection with steps to memorize It is characterized by having a step of determining whether a match by comparing the contents of the failure information and the previous fault information.

本発明によれば、障害が起こった場合、障害となった要因と関数の呼び出し状況をみて、同じ障害かどうかを判断し、同じ原因の障害の場合に障害情報ファイルが書き出されないようにする障害情報管理システムを提供することができる。 According to the present invention, when a failure occurs, it is determined whether or not the failure is the same by looking at the cause of the failure and the function call status, so that the failure information file is not written in the case of the failure of the same cause. A failure information management system can be provided.

本発明の第１の実施の形態における障害情報管理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the failure information management system in the 1st Embodiment of this invention. 本発明の第１の実施の形態における障害情報の内容の１例を示す図である。It is a figure which shows one example of the content of the failure information in the 1st Embodiment of this invention. 本発明の第１の実施の形態における障害情報管理システムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the failure information management system in the 1st Embodiment of this invention.

以下、本発明の実施形態について図面を参照して詳細に説明する。
（実施形態）
図１を用いて、本発明の実施形態の構成を説明する。図１は、本発明の実施形態における障害情報管理システムの構成を示すブロック図である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
(Embodiment)
The configuration of the embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a configuration of a failure information management system according to an embodiment of the present invention.

図１において、障害情報管理システム１０は、ノード１０００と、ダンプ要求手段１００とシステムダンプ採取部２００と、主記憶部３００と、外部記憶装置２０の中のダンプ領域部２０００とから構成されている。システムダンプ採取部２００は、ダンプファイル出力判定手段２１０と、ダンプファイル出力部２２０と、障害トレースバック情報採取部２３０とを有している。主記憶部３００は、障害が起こった場合の障害情報を、障害トレースバック情報採取部２３０の要求により障害トレースバック情報３１０として記憶する。ダンプ領域部２０００には、情報ファイルとして複数のダンプファイル４００−１、・・・４００−ｎがあり、さらに、それぞれのダンプファイルには、ダンプ障害トレースバック情報４１０−１、・・・４１０−ｎが格納されている。複数のノードがある場合も、外部記憶装置２０のダンプ領域部２０００に、まとめてダンプファイルが保存されている。 In FIG. 1, the failure information management system 10 includes a node 1000, a dump request unit 100, a system dump collection unit 200, a main storage unit 300, and a dump area unit 2000 in the external storage device 20. . The system dump collection unit 200 includes a dump file output determination unit 210, a dump file output unit 220, and a failure traceback information collection unit 230. The main storage unit 300 stores failure information when a failure occurs as failure traceback information 310 in response to a request from the failure traceback information collection unit 230. The dump area unit 2000 includes a plurality of dump files 400-1,... 400-n as information files, and each dump file includes dump failure traceback information 410-1,. n is stored. Even when there are a plurality of nodes, dump files are collectively stored in the dump area unit 2000 of the external storage device 20.

図２は、障害トレースバック情報３１０の詳細を表した図である。障害トレースバック情報３１０には、カーネル情報、障害となった要因、障害となったコマンド名と、障害に至るまでの関数の呼び出し状況（トレースバック情報）から構成される。トレースバック情報は、障害発生までの命令の履歴を表している。 FIG. 2 is a diagram showing details of the failure traceback information 310. The failure traceback information 310 includes kernel information, the cause of the failure, the name of the failed command, and the function call status (traceback information) up to the failure. The traceback information represents a history of instructions up to the occurrence of a failure.

例えば、図２の上から３行目までは、カーネル情報で、Ｋｎｌｉｎｆｏは、どこで作成されたカーネルかを示し、Ｃａｕｓｅは、どんな原因で障害が発生したかを示している。ｃｏｍｍは、コメント欄である。 For example, from the top to the third line of FIG. 2, in the kernel information, Kninfo indicates where the kernel was created, and Cause indicates what causes the failure. comm is a comment field.

次に、４行目から下は、関数の呼び出し状況（トレースバック情報）であり、関数のアドレスと呼び出された関数を表示している。上の行ほど、新しく呼び出されていることを示している。この障害となった要因と関数の呼び出し状況（トレースバック情報）が同じならば、同じ障害であると推定できる。なお、ダンプファイル４００−１には、この障害トレースバック情報３１０が、ダンプ障害トレースバック情報４１０−１として書き込まれる。 Next, from the fourth line to the bottom is the function call status (traceback information), which displays the function address and the called function. The upper line shows the new call. If the cause of the failure is the same as the function call status (traceback information), it can be estimated that the failure is the same. The failure traceback information 310 is written as dump failure traceback information 410-1 in the dump file 400-1.

図１、図３を用いて本発明の実施形態の動作について詳細に説明する。 The operation of the embodiment of the present invention will be described in detail with reference to FIGS.

図３は、本発明の実施形態の動作を示すフローチャートである。 FIG. 3 is a flowchart showing the operation of the embodiment of the present invention.

まず、ノード１０００において障害が発生すると（Ｓ３０１）、ダンプ要求手段１００が、システムダンプ採取部２００のダンプファイル出力判定手段２１０に障害発生を通知する（Ｓ３０２）。ダンプファイル出力判定手段２１０は、障害トレースバック情報採取部２３０に通知を出し、障害トレースバック情報３１０を作成するよう依頼する（Ｓ３０３）。障害トレースバック情報採取部２３０は、主記憶部３００中の障害情報から、障害トレースバック情報３１０を作成する（Ｓ３０４）。 First, when a failure occurs in the node 1000 (S301), the dump request unit 100 notifies the dump file output determination unit 210 of the system dump collection unit 200 of the occurrence of the failure (S302). The dump file output determination unit 210 notifies the failure traceback information collection unit 230 and requests to create the failure traceback information 310 (S303). The failure traceback information collecting unit 230 creates the failure traceback information 310 from the failure information in the main storage unit 300 (S304).

ダンプファイル出力判定手段２１０は、ダンプ領域部２０００中のダンプファイル４００−１を検索し、ダンプファイル４００−１中のダンプ障害トレースバック情報４１０−１を読み込む（Ｓ３０５）。ダンプファイル出力判定手段２１０は、障害トレースバック情報３１０と、ダンプ障害トレースバック情報４１０−１を比較し、障害トレースバック情報３１０と同じ障害原因、同じトレースバック情報を持つ情報がないかどうか調査する（Ｓ３０６）。 The dump file output determination unit 210 searches the dump file 400-1 in the dump area 2000 and reads the dump failure traceback information 410-1 in the dump file 400-1 (S305). The dump file output determination unit 210 compares the failure traceback information 310 and the dump failure traceback information 410-1, and investigates whether there is information having the same cause of failure and the same traceback information as the failure traceback information 310. (S306).

同じトレースバック情報を持つ情報がない場合は（Ｓ３０７のＮｏ）、ダンプファイル出力判定手段２１０は、ダンプ領域部２０００中の次のダンプファイル４００−２を検索し（Ｓ３０８のＹｅｓ）、同様にダンプ障害トレースバック情報４１０−２を読み込み、障害トレースバック情報３１０と、ダンプ障害トレースバック情報４１０−２を比較する（Ｓ３０５からＳ３０７）。 When there is no information having the same traceback information (No in S307), the dump file output determination unit 210 searches for the next dump file 400-2 in the dump area unit 2000 (Yes in S308), and dumps in the same manner. The fault traceback information 410-2 is read, and the fault traceback information 310 and the dump fault traceback information 410-2 are compared (S305 to S307).

ダンプ領域部２０００中のすべてのダンプファイル４００−１から４００−ｎを検索し、すべてのダンプファイル４００−１から４００−ｎのダンプ障害トレースバック情報４１０−１から４１０−ｎが、障害トレースバック情報３１０と一致しない場合（Ｓ３０７のＮｏ、Ｓ３０８のＮｏ）、ダンプファイル出力部２２０により、ダンプ障害トレースバック情報４１０−（ｎ＋１）を加えたダンプファイル４００−（ｎ＋１）を、ダンプ領域部２０００に出力する（Ｓ３０９）。 All the dump files 400-1 to 400-n in the dump area unit 2000 are searched, and the dump failure traceback information 410-1 to 410-n of all the dump files 400-1 to 400-n is stored in the failure traceback. If the information 310 does not match (No in S307, No in S308), the dump file output unit 220 adds the dump file 400- (n + 1) including the dump failure traceback information 410- (n + 1) to the dump area unit 2000. It outputs (S309).

逆に、同じトレースバック情報を持つ情報がある場合は（Ｓ３０７のＹｅｓ）、新たなダンプファイルを出力せずに終了する（Ｓ３１０）。 Conversely, if there is information having the same traceback information (Yes in S307), the process ends without outputting a new dump file (S310).

以上、説明してきたように、本発明の実施形態によれば、障害が起こった場合、障害となった要因と関数の呼び出し状況をみて、同じ障害かどうかを判断し、同じ原因の障害の場合にのみ障害情報ファイルが書き出されない障害情報管理システムを提供することができる。 As described above, according to the embodiment of the present invention, when a failure occurs, it is determined whether the failure is the same by looking at the cause of the failure and the function call status. It is possible to provide a failure information management system in which a failure information file is not written only in

本発明の実施形態によれば、重複するダンプファイルを保存しなくてもよいので、ダンプ領域のディスク容量を削減することができる。また、重複するダンプファイルを保存しないので、重複するダンプファイルを解析することなく解析作業の効率が良くなるという効果もある。 According to the embodiment of the present invention, it is not necessary to save a duplicate dump file, so that the disk capacity of the dump area can be reduced. Further, since duplicate dump files are not stored, there is an effect that the efficiency of analysis work is improved without analyzing duplicate dump files.

尚、本願発明は、上述の実施の形態に限定されるものではなく、本願発明の要旨を逸脱しない範囲で種々変更、変形して実施することが出来る。 The present invention is not limited to the above-described embodiment, and can be implemented with various changes and modifications without departing from the gist of the present invention.

本発明は、障害が発生した場合にダンプファイルが生成される障害情報管理システムに利用可能である。 The present invention is applicable to a failure information management system in which a dump file is generated when a failure occurs.

１０障害情報管理システム
２０外部記憶装置
１００ダンプ要求手段
２００システムダンプ採取部
２１０ダンプファイル出力判定手段
２２０ダンプファイル出力部
２３０障害トレースバック情報採取部
３００主記憶部
３１０障害トレースバック情報
４００−１〜４００−（ｎ＋１）ダンプファイル
４１０−１〜４１０−（ｎ＋１）ダンプ障害トレースバック情報
１０００ノード（コンピュータ）
１００１ノード（コンピュータ）
２０００ダンプ領域部 DESCRIPTION OF SYMBOLS 10 Failure information management system 20 External storage device 100 Dump request means 200 System dump collection part 210 Dump file output determination means 220 Dump file output part 230 Failure trace back information collection part 300 Main storage part 310 Failure trace back information 400-1 to 400 -(N + 1) Dump file 410-1 to 410- (n + 1) Dump failure traceback information 1000 Node (computer)
1001 Node (computer)
2000 Dump area

Claims

In a failure information management system for managing an information file including failure information automatically generated when a process of a computer system is abnormal,
Dump request means for requesting to generate the information file including the failure information when the process is abnormal;
A system dump collection unit that generates the information file including the failure information in response to a request from the dump request unit;
A storage unit for storing the information file including the failure information generated by the system dump collection unit;
An external storage device for storing a previous information file containing previous failure information;
Have
The external storage device is provided in at least one computer system and stores the failure information automatically generated from each computer constituting the computer system when a process abnormality occurs,
The system dump collection unit includes a dump file output determination unit that determines whether or not the failure information and the previous failure information match by comparing the contents of the failure information and the previous failure information.

2. The failure according to claim 1, wherein the system dump collection unit stores the information file including the failure information in the external storage device when the determination result of the dump file output determination unit does not match. Information management system.

The failure information management system according to claim 1, wherein the failure information includes traceback information representing a history of instructions up to the occurrence of a failure.

4. The failure information management according to claim 1, wherein the external storage device stores the failure information that is automatically generated when a process abnormality occurs in the computer system without duplication. 5. system.

In a failure information management method of a failure information management system for managing an information file including failure information automatically generated when a process of a computer system is abnormal,
Requesting to generate the information file including the failure information when the process is abnormal;
Generating the information file including the failure information in response to a request from a dump request unit;
Storing the information file including the failure information generated by a system dump collection unit;
Storing a previous information file containing previous failure information;
Have
An external storage device is provided in at least one of the computer systems, and has a step of storing the failure information automatically generated at the time of a process abnormality from each computer constituting the computer system,
The system dump collection unit includes a step of comparing the failure information with the content of the previous failure information and determining whether or not they match each other.

7. The failure information management method according to claim 6, wherein the system dump collection unit includes a step of storing the information file including the failure information when the determination result of the dump file output determination unit does not match. .