JP2015028700A

JP2015028700A - Failure detection device, failure detection method, failure detection program and recording medium

Info

Publication number: JP2015028700A
Application number: JP2013157773A
Authority: JP
Inventors: 明彦西谷; Akihiko Nishitani; 茂莉黒川; Mori Kurokawa
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2013-07-30
Filing date: 2013-07-30
Publication date: 2015-02-12

Abstract

PROBLEM TO BE SOLVED: To obtain a failure detection device capable of detecting a failure with high accuracy by observing and analyzing a plurality of kinds of states in parallel in a monitoring system including a monitoring server for monitoring a silent failure.SOLUTION: The failure detection device for monitoring an occurrence of a silent failure of a monitoring object host includes: an abnormality determination part for comparing a system log in the monitoring object host with a normal state model being a log transition of a system log existing in the past in the monitoring object host to determine an abnormality; and a failure estimation part for taking a negative event due to at least one observation data among SNS information, call center information in a carrier, user behavior information, and the number of accesses to a service into consideration, and comparing a word appearance distribution of the preceding system log with a result estimation model being a word appearance distribution within a fixed time related to a new transition appearance time of the past system log to thereby estimate a silent failure in the case that the abnormality determination part determines an abnormality.

Description

本発明は、複数の監視対象ホストのサイレント障害（発生しても症状が出ない障害）を監視する監視サーバを備えた障害検知システムに関し、特に、サイレント障害の発生を効率良く検出する障害検知装置、障害検知方法、障害検知プログラム及び記録媒体に関する。 The present invention relates to a failure detection system including a monitoring server that monitors a silent failure (a failure that does not cause a symptom even if it occurs) of a plurality of monitored hosts, and in particular, a failure detection device that efficiently detects the occurrence of a silent failure. The present invention relates to a failure detection method, a failure detection program, and a recording medium.

従来から、ウェブサーバやメールサーバ等のサーバシステムにおけるサイレント障害に関しては、監視対象ホストとなるサーバから出力されるテキストのログであるシスログ(syslog)をモニタリングし、ＣＰＵ負荷、メモリ利用量、Ｉ／Ｏ待ち数、パケット数等のリソース情報をチェックすることで、サイレント障害発生の有無を判断する手法が行われていた。 Conventionally, with respect to silent failures in server systems such as web servers and mail servers, a syslog, which is a text log output from a server that is a monitored host, is monitored, CPU load, memory usage, I / O A method for determining whether or not a silent failure has occurred has been performed by checking resource information such as the number of O waits and the number of packets.

例えば、閾値監視による異常検知によれば、ＣＰＵ使用率、メモリ使用量等の指定された監視項目を定期的に監視し、観測値が設定した閾値を上下するかしないかで異常を検知する。死活監視、リソース監視技術を用いる一般的な監視ツール（Zabbix、Nagios等）が相当する。
また、相関分析による異常検知によれば、リソース情報、性能情報等の互いの時系列な普遍関係の存在を発見・モデル化・監視し、平常時モデルと異なる挙動の調査により異常を検知する。
状態遷移パターン分析による異常検知によれば、ログレコードやリソース情報の時系列な遷移パターンをモデル化・監視し、平常時モデルには無い遷移パターンの調査により異常を検知する。
上述したシステム又はプロセスの動作の異常を検出する技術は、例えば、特許文献１〜特許文献３及び非特許文献１〜２に開示されている。 For example, according to abnormality detection by threshold monitoring, designated monitoring items such as CPU usage rate and memory usage are periodically monitored, and abnormality is detected by whether or not the threshold value set by the observed value is raised or lowered. It corresponds to general monitoring tools (Zabbix, Nagios, etc.) that use life and death monitoring and resource monitoring technology.
In addition, according to abnormality detection by correlation analysis, the existence of a time-series universal relationship such as resource information and performance information is discovered, modeled, and monitored, and abnormality is detected by investigating behavior different from the normal model.
According to anomaly detection by state transition pattern analysis, time series transition patterns of log records and resource information are modeled and monitored, and anomalies are detected by investigating transition patterns not found in the normal model.
Techniques for detecting an abnormal operation of the system or process described above are disclosed in, for example, Patent Literature 1 to Patent Literature 3 and Non-Patent Literature 1 and 2.

特願２０１２−２７５１１３号Japanese Patent Application No. 2012-275113 特開２０１２−０９４０４６号公報JP 2012-094046 A 特開２０１０−２５０５０２号公報JP 2010-250502 A

電子情報通信学会Ｖｏｌ１００、Ｐ５０-Ｐ６０The Institute of Electronics, Information and Communication Engineers Vol100, P50-P60 NEC技報 Vol.63 No.2/2010．「WebSAM Ver.8が実現するクラウド時代のデータセンター運用」NEC Technical Journal Vol.63 No.2 / 2010. "Data center operation in the cloud era realized by WebSAM Ver.8"

しかしながら、上述した手法であると、リソース情報のみを基に障害の有無を判断するため、障害発生に関して精度良い判断ができない（障害とは関係の無い異常も検知する率が高く、かつ検知した異常が何の障害によるものなのかわかりづらい）という課題があった。
また、全ての監視対象ホストに対してリソース情報を定期的にモニタリングする必要があるため、監視するトラフィック量が多くなり監視負荷も高くなるという課題があった。 However, with the method described above, the presence / absence of a failure is determined based only on resource information, so accurate determination cannot be made regarding the occurrence of a failure (the rate of detecting abnormalities unrelated to failures is high, and detected abnormalities There was a problem that it was difficult to understand what kind of obstacle was caused by this.
In addition, since it is necessary to periodically monitor resource information for all monitored hosts, there is a problem that the amount of traffic to be monitored increases and the monitoring load also increases.

従来の相関分析や遷移パターン分析といった検知技術では、監視システムは正常状態を示すモデルをシステム内部に保持し、定期的に観測する現時点の状態を正常状態モデルに照らし合わせ異常か否かを判定し、異常であればアラームを発行する。そこで、異常と判断された状態が継続し、アラームが連続的に発行されるのを防ぐため、一度検出された異常は正常状態としてモデルに学習され、以降同様の状態が発生しても異常としては認識しない仕組み（自動学習機能）を備えている。
しかし、この自動学習機能を利用した場合、同種の異常が複数回発生しても最初の一度しか検知されないため、規模は小さいが（件数は少ないが）継続的に異常が発生しているといった状況を検知することができないという問題があった。 In conventional detection techniques such as correlation analysis and transition pattern analysis, the monitoring system maintains a model indicating the normal state inside the system, and compares the current state that is regularly observed with the normal state model to determine whether there is an abnormality. If it is abnormal, an alarm is issued. Therefore, in order to prevent the state that is determined to be abnormal from continuing and alarms from being issued continuously, the abnormality once detected is learned to the model as a normal state, and thereafter even if a similar state occurs, Has a mechanism that does not recognize (automatic learning function).
However, when this automatic learning function is used, even if the same type of abnormality occurs multiple times, it is detected only once, so the scale is small (the number of cases is small), but the abnormality continues to occur. There was a problem that could not be detected.

本発明は上記実情に鑑みて提案されたもので、サイレント障害を監視する監視サーバを備えた監視システムにおいて、監視対象ホストを監視する場合の監視負荷の軽減を図りながら、複数種類の状態を並行して観測分析することで、精度良く障害を検知可能とする障害検知装置、障害検知方法、障害検知プログラム及び記録媒体を提供することを目的としている。 The present invention has been proposed in view of the above circumstances, and in a monitoring system including a monitoring server for monitoring a silent failure, a plurality of types of states can be set in parallel while reducing the monitoring load when monitoring a monitored host. It is an object of the present invention to provide a failure detection device, a failure detection method, a failure detection program, and a recording medium that can accurately detect a failure by performing observation analysis.

上記目的を達成するため本発明は、監視対象ホストのサイレント障害の発生を監視する障害検知装置において、次の構成を含むことを特徴としている。
前記監視対象ホストにおけるシステムログと、前記監視対象ホストにおける過去に存在したシステムログのログ遷移である正常状態モデルとを比較して異常を判定する異常判定部。
前記異常判定部で異常が判定された場合に、ＳＮＳ情報、キャリアにおけるコールセンター情報、ユーザ行動情報、サービスへのアクセス数の内の少なくとも一つの観測データによるネガティブな事象を考慮するとともに、直前のシステムログの単語出現分布と、過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布である結果予想モデルとを比較することでサイレント障害を推定する障害推定部。 In order to achieve the above object, the present invention is characterized in that a failure detection apparatus for monitoring the occurrence of a silent failure of a monitored host includes the following configuration.
An abnormality determination unit that determines an abnormality by comparing a system log in the monitored host with a normal state model that is a log transition of a system log that has existed in the past in the monitored host.
When an abnormality is determined by the abnormality determination unit, a negative event due to at least one observation data among the SNS information, call center information in the carrier, user behavior information, and the number of accesses to the service is taken into account. A fault estimator that estimates a silent fault by comparing a word appearance distribution in a log with a result prediction model that is a word appearance distribution within a certain time linked to a new transition appearance in a past system log.

請求項２の発明は、監視対象ホストのサイレント障害の発生を監視する障害検知装置において、次の構成を含むことを特徴としている。
前記監視対象ホストにおける過去に存在したシステムログのログ遷移を正常状態モデルとして読み込むと共に、過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布を結果予想モデルとして読み込む初期設定部。
前記監視対象ホストにおけるログ遷移監視によりシステムログと、ＳＮＳ情報、キャリアにおけるコールセンター情報、ユーザ行動情報、サービスへのアクセス数の内の少なくとも一つの観測データとを読み込む情報収集部。
前記システムログを前記各モデルと比較可能な形式に加工するデータ情報加工部。
前記システムログと前記正常状態モデルを比較して異常を判定する異常判定部。
前記異常判定部で異常が判定された場合に、前記観測データによるネガティブな事象を考慮するとともに、直前のシステムログの単語出現分布と前記結果予想モデルとを比較することでサイレント障害を推定する障害推定部。
学習効果による正常状態モデルの生成保持と、前記障害推定部によりサイレント障害が推定された結果予想モデルの生成保持を行うモデル生成部。
前記モデル作成部において一定期間学習することなく前記異常判定部において前記システムログの新規遷移を観測する非学習期間を管理するタイマ管理部。 According to a second aspect of the present invention, a failure detection apparatus for monitoring the occurrence of a silent failure in a monitored host includes the following configuration.
An initial setting unit that reads a log transition of a system log that has existed in the past in the monitored host as a normal state model, and reads a word appearance distribution within a certain time associated with the appearance of a new transition in the past system log as a result prediction model.
An information collection unit that reads a system log and SNS information, call center information in a carrier, user behavior information, and at least one observation data among access counts by log transition monitoring in the monitoring target host.
A data information processing unit that processes the system log into a format comparable to each model.
An abnormality determination unit that determines an abnormality by comparing the system log and the normal state model.
A failure for estimating a silent failure by comparing a word appearance distribution of the immediately preceding system log with the result prediction model while considering a negative event based on the observation data when an abnormality is determined by the abnormality determination unit Estimator.
A model generation unit that generates and holds a normal state model based on a learning effect and generates and holds a result prediction model in which a silent failure is estimated by the failure estimation unit.
A timer management unit that manages a non-learning period in which the abnormality determination unit observes a new transition of the system log without learning for a certain period in the model creation unit.

請求項３は、監視対象ホストのサイレント障害の発生を監視する方法において、
前記監視対象ホストのシステムログと、ＳＮＳ情報、キャリアにおけるコールセンター情報、ユーザ行動情報、サービスへのアクセス数の内の少なくとも一つの観測データについて、それぞれ正常状態モデルを生成保持し、複数種類の観測項目を組み合わせて異常の有無を観測する手順と、
前記監視対象ホストの過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布を結果予想モデルとして保持し、複数の観測項目において同時に異常が見受けられる際の直前システムログの単語出現分布と前記結果予想モデルとを比較することでサイレント障害の有無を判断する手順と、
を備えたことを特徴としている。 Claim 3 is a method for monitoring the occurrence of a silent failure of a monitored host.
A normal state model is generated and held for each of the monitored host system log and at least one of the SNS information, call center information in the carrier, user behavior information, and number of accesses to the service, and a plurality of types of observation items A procedure for observing whether there is an abnormality by combining
A word appearance distribution within a certain period of time associated with the appearance of a new transition in the past system log of the monitored host is retained as a result prediction model, and a word appearance distribution of the immediately preceding system log when abnormality is simultaneously observed in a plurality of observation items And a procedure for determining the presence or absence of silent failure by comparing the result prediction model,
It is characterized by having.

請求項４は、請求項３の障害検知方法において、前記結果予想モデルは、新規遷移の検知数に紐づく単語出現分布（出現率）情報、又は、新規遷移のカテゴリ構成要素に紐づく単語出現分布（出現率）情報により形成することを特徴としている。 4. The failure detection method according to claim 3, wherein the result prediction model includes word appearance distribution (appearance rate) information associated with the number of detections of new transitions, or word appearances associated with category elements of new transitions. It is formed by distribution (appearance rate) information.

請求項５は、請求項３の障害検知方法において、前記結果予想モデルを、マハラノビス距離、Z検定、T検定等の分析手法により、前記直前システムログの状態と比較することを特徴としている。 According to a fifth aspect of the present invention, in the failure detection method of the third aspect, the result prediction model is compared with the state of the immediately preceding system log by an analysis method such as Mahalanobis distance, Z test, or T test.

請求項６は、請求項３の障害検知方法において、前記ログ遷移監視により検出される新規遷移の組の数が、運用者が指定した閾値を超えた時点から運用者が指定した期間学習せず、前記新規遷移が毎時間閾値を超える数分出続けるか観測することにより、ノイズを除去し、目立たず継続する前記閾値を超えた状態を注視して異常として捉えることでサイレント障害の可能性を疑うことを特徴としている。 According to a sixth aspect of the present invention, in the failure detection method of the third aspect, learning is not performed for a period specified by the operator from when the number of new transition pairs detected by the log transition monitoring exceeds a threshold specified by the operator. By observing whether the new transition continues for several minutes exceeding the threshold every hour, noise is removed, and the state exceeding the threshold that continues inconspicuously is watched as an abnormality and the possibility of silent failure is It is characterized by doubt.

請求項７は、請求項６の障害検知方法において、前記閾値を超える新規遷移は、全て継続して観測することを特徴としている。 According to a seventh aspect of the present invention, in the fault detection method of the sixth aspect, all new transitions exceeding the threshold value are continuously observed.

請求項８は、請求項６の障害検知方法において、前記閾値を超える新規遷移は、特定のキーワードを含む新規遷移のみ継続して観測することを特徴としている。 The fault detection method according to claim 6 is characterized in that, in the fault detection method according to claim 6, only new transitions including a specific keyword are continuously observed as new transitions exceeding the threshold.

請求項９は、請求項３の障害検知方法において、複数の観測対象を、システム稼働開始時あるいは稼働中に運用者が指定可能としたことを特徴としている。 A ninth aspect is characterized in that, in the failure detection method according to the third aspect, an operator can designate a plurality of observation targets at the start of system operation or during operation.

請求項１０は、請求項３の障害検知方法において、障害と判断された状態を障害状態モデルとして保管し、以降に同様の状態が検出された場合は、過去の障害状態モデルから抽出した情報を参考値として提示可能としたことを特徴としている。 Claim 10 stores the state determined as a failure as a failure state model in the failure detection method according to claim 3, and if a similar state is detected thereafter, information extracted from a past failure state model is stored. It is characterized in that it can be presented as a reference value.

請求項１１は、請求項３の障害検知方法において、障害と判断された状態を、システム稼働開始時あるいは稼働中に運用者の指示により障害モデルに学習させることを可能とすることを特徴としている。 The failure detection method according to claim 11 is characterized in that, in the failure detection method according to claim 3, it is possible to cause a failure model to learn a state determined as a failure at the start of system operation or during operation by an instruction from an operator. .

請求項１２は、請求項３に記載の監視対象ホストの障害検知方法の各手順をコンピュータにより実行することが可能な障害検知プログラムを特徴としている。 A twelfth aspect is characterized by a failure detection program capable of executing, by a computer, each procedure of the failure detection method for a monitored host according to the third aspect.

請求項１３は、請求項１２に記載の障害検知プログラムが格納されたコンピュータに読み取り可能な記録媒体であることを特徴としている。 A thirteenth aspect is a computer-readable recording medium storing the fault detection program according to the twelfth aspect.

本発明によれば、サーバシステム（監視対象ホスト）からのシステムログと、ＳＮＳ（呟き情報等）、お客様からの問合せ状況（キャリアにおけるコールセンター情報等）、行動情報（Webアクセス情報、ＧＰＳ情報等）、サービスへのアクセス数の複数種類の観測項データを組み合わせて観測し、複数種類の観測項目において同時に異常が発見された場合に障害を疑うとともに、直前のシステムログの単語出現分布と、過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布である結果予想モデルとを比較することでサイレント障害を推定するので、ログ分析とともに他の観測結果における異常を考慮することで、サーバシステムにおける障害発生の有無の判断についての精度を高めることができる。 According to the present invention, the system log from the server system (monitoring target host), SNS (sowing information, etc.), inquiry status from customers (call center information in the carrier, etc.), behavior information (Web access information, GPS information, etc.) , Observing a combination of multiple types of observation data for the number of accesses to the service, and suspecting a failure when anomalies are simultaneously found in multiple types of observation items, Silent failure is estimated by comparing the result prediction model, which is the word appearance distribution within a certain time linked to the appearance of new transitions in the system log, so the server can be analyzed by considering abnormalities in other observation results along with log analysis. It is possible to improve the accuracy of determining whether or not a failure has occurred in the system.

また、観測対象にて検知した異常の数が、運用者が指定した閾値を超えた時間から運用者が指定した期間（例えば数時間とか２日間とか）モデル学習せず、一定期間異常が継続するかどうか観測することで、動的に学習することで異常有無が判断できない状況が発生することを防止する。 Also, the period of time specified by the operator (for example, several hours or two days) is not learned from the time when the number of abnormalities detected in the observation target exceeds the threshold specified by the operator, and the abnormality continues for a certain period of time. By observing whether or not it is possible to prevent a situation in which the presence or absence of abnormality cannot be determined by learning dynamically.

本発明の障害検知装置を使用して各種サーバシステムのシステムログと、ＳＮＳによるつぶやき情報を収集して障害を検知する場合のシステム全体構成図である。It is the whole system block diagram in the case of collecting the system log of various server systems and the tweet information by SNS, and detecting a failure using the failure detection device of the present invention. 本発明の障害検知装置の構成を示すブロック図である。It is a block diagram which shows the structure of the failure detection apparatus of this invention. 障害検知装置の異常判定部の構成を示すブロック図である。It is a block diagram which shows the structure of the abnormality determination part of a failure detection apparatus. 監視装置に読み込まれたログ情報に対してクラスタリングを行ってモデル化する場合の説明図である。It is explanatory drawing at the time of modeling by clustering with respect to the log information read by the monitoring apparatus. 障害検知装置が監視するサーバシステムにおける異常パターンの発生例を示すモデル図である。It is a model figure which shows the example of generation | occurrence | production of the abnormal pattern in the server system which a failure detection apparatus monitors.

監視対象ホストのサイレント障害の発生を監視する障害検知装置を備えた監視システムの実施形態の一例について、図１〜図４を参照しながら説明する。
障害検知装置（監視サーバ）１０は、図１に示すように、監視対象となる複数のサーバシステム（監視対象ホスト）１からシステムログ情報の収集を行うとともに、ソーシャルネットワーク（ＳＮＳ）２からの情報を収集し、それぞれ正常モデルとの比較を行い、双方異常であればサイレント障害の発生を疑うとともに、過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布である結果予想モデルとの照合を行うことでサイレント障害の有無を推定し、各サーバシステム（監視対象ホスト）１の運用者に対してメールや画面表示、音声等でアラームを発行して障害検知結果の通知を行うものである。 An example of an embodiment of a monitoring system including a failure detection device that monitors the occurrence of a silent failure of a monitored host will be described with reference to FIGS.
As shown in FIG. 1, the failure detection apparatus (monitoring server) 10 collects system log information from a plurality of server systems (monitoring target hosts) 1 to be monitored and information from the social network (SNS) 2. Are compared with the normal model, and if both are abnormal, a silent failure is suspected, and a result prediction model that is a word appearance distribution within a certain time linked to the appearance of a new transition in the past system log and The presence of a silent failure is estimated by performing verification, and the failure detection result is notified to the operator of each server system (monitored host) 1 by issuing an alarm by e-mail, screen display, voice, etc. It is.

障害検知装置（監視サーバ）１０は、Linux（登録商標）等の一般的なオペレーティングシステム（ＯＳ）を含む基本プログラムや各種の基本デバイスが記憶されたＲＯＭと、各種のプログラムやデータが記憶されるハードディスクドライブ装置（ＨＤＤ）と、ＣＲ−ＲＯＭやＤＶＤ等の記憶媒体からプログラムやデータを読み出すメディアドライブ装置と、プログラムを実行するＣＰＵと、このＣＰＵにワークエリアを提供するＲＡＭと、外部装置と通信するパラレル／シリアルＩＦとを主要な構成としたコンピュータで構成され、記録媒体等を介して監視プログラムがＨＤＤに格納されることで、監視対象となる各種サーバシステム１のサイレント障害を検知可能なように構成されている。 The failure detection device (monitoring server) 10 stores a basic program including a general operating system (OS) such as Linux (registered trademark) and a ROM that stores various basic devices, and various programs and data. A hard disk drive (HDD), a media drive that reads programs and data from a storage medium such as a CR-ROM or DVD, a CPU that executes the program, a RAM that provides a work area for the CPU, and a communication with an external device It is configured by a computer having a parallel / serial IF as a main component, and a monitoring program is stored in the HDD via a recording medium or the like so that silent failure of various server systems 1 to be monitored can be detected. It is configured.

障害検知装置（監視サーバ）１０は、図２に示すように、初期設定パラメータや過去モデルを読み込む初期設定部１１と、複数観測データを読み込む情報収集部１２と、読み込んだデータを加工するデータ情報加工部１３と、異常判定部１４と、障害推定部１５と、モデル生成部１６と、タイマ管理部１７と、アラーム発行部１８を備えて構成されている。 As shown in FIG. 2, the failure detection apparatus (monitoring server) 10 includes an initial setting unit 11 that reads initial setting parameters and past models, an information collection unit 12 that reads a plurality of observation data, and data information that processes the read data. The processing unit 13, the abnormality determination unit 14, the failure estimation unit 15, the model generation unit 16, the timer management unit 17, and the alarm issue unit 18 are configured.

初期設定部１１は、各種サーバシステム（監視対象ホスト）１における過去に存在したシステムログのログ遷移を正常状態モデルとして読み込むと共に、過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布を結果予想モデルとして読み込む。
結果予想モデルは、例えば、「仮にこのようなパターンの新規遷移が出現した場合には、それ以前の一定期間におけるログ中の単語出現率はこういう分布と予想される」といったような予測を、過去のデータを基に行うことで作成される単語出現率の分布モデルから構成される。 The initial setting unit 11 reads a log transition of a system log that has existed in the past in various server systems (monitoring target hosts) 1 as a normal state model, and a word appearance within a certain time associated with a new transition appearance of a past system log. Load the distribution as a result prediction model.
As a result prediction model, for example, if a new transition of such a pattern appears, a prediction such as “the word appearance rate in the log in a certain period before that is expected to have such a distribution” It is composed of the distribution model of the word appearance rate created by performing based on the above data.

具体的な結果予想モデルＸ（単語出現分布）は、
単語の分布モデルが、
単語１：６０％
単語２：５０％
単語３：３５％
…
単語Ｎ：〜％
のように、出現率トップＮ個の単語をピックアップしたデータから構成される。 The specific result prediction model X (word appearance distribution) is
The word distribution model is
Word 1: 60%
Word 2: 50%
Word 3: 35%
...
Word N: ~%
As shown in FIG. 5, the word is composed of data obtained by picking up the top N words having an appearance rate.

そして、結果予想モデルと実際のシステムログの分布モデルとを比較した場合に、実際のシステムログの分布モデルが結果予想モデルで予測された分布モデル通りにならなかった場合、未知の障害（サイレント障害）として疑うことが可能となる。 When comparing the result prediction model with the actual system log distribution model, if the actual system log distribution model does not match the distribution model predicted by the result prediction model, an unknown failure (silent failure) ) Can be suspected.

初期設定部１１には、過去に生成した障害状態モデルの情報が存在する場合には、障害状態モデルとして読み込まれる。
例えば、過去に障害と判断した際の
（１）各種サーバシステム１におけるログ遷移パターン、
（２）ＳＮＳ情報２におけるネガティブな呟きの出現数（率）リスト等、
（３）各種サーバシステム１における直近のログメッセージ中の単語出現分布、
が障害状態モデルとして保管される。
初期設定部１１に読み込まれた各情報は、情報収集部１２、データ情報加工部１３、異常判定部１４、障害推定部１５、モデル生成部１６の各部間で読み出し可能なメモリ上にて共有する。
また、障害検知を行う場合の各種設定に関する初期設定パラメータが読み込まれる。 If there is information on a fault state model generated in the past in the initial setting unit 11, it is read as a fault state model.
For example, (1) log transition patterns in various server systems 1 when it is determined that a failure has occurred in the past,
(2) List of the number (rate) of negative whispering in SNS information 2
(3) word appearance distribution in the latest log message in various server systems 1;
Is stored as a fault condition model.
Each information read into the initial setting unit 11 is shared on a readable memory among the information collection unit 12, the data information processing unit 13, the abnormality determination unit 14, the failure estimation unit 15, and the model generation unit 16. .
Also, initial setting parameters relating to various settings when performing failure detection are read.

情報収集部１２は、各サーバシステム（監視対象ホスト）１におけるログ遷移監視によるシステムログと、呟き等のＳＮＳ情報、キャリアにおけるコールセンター情報、Ｗｅｂアクセス情報やＧＰＳ情報等のユーザ行動情報、サービスへのアクセス数等の複数種類の観測データを読み込む。 The information collection unit 12 includes a system log by log transition monitoring in each server system (monitoring target host) 1, SNS information such as whispering, call center information in a carrier, user behavior information such as Web access information and GPS information, and service information Load multiple types of observation data such as number of accesses.

データ情報加工部１３は、情報収集部１２で収集されたシステムログ及び観測データについて、初期設定部１１に読込まれた正常状態モデル、障害状態モデル、結果予側モデルの各モデルと比較可能なデータ形式への加工が行われる。 The data information processing unit 13 is a system log and observation data collected by the information collecting unit 12 that can be compared with the normal state model, the failure state model, and the result prediction model read into the initial setting unit 11. Processing into a form is performed.

異常判定部１４は、システムログ（原因情報源）と正常状態モデル（原因情報源正常モデル）、観測データ（結果情報源）と正常状態モデル（結果情報源正常モデル）を比較して異常を判定する。
異常判定部１４は、システムログの分析を行って異常の有無を判断するため、図３に示すように、ログ読込部２１と、ログクラスタリング部２２と、ログ分析部２３とを備えて構成されている。ログ読込部２１では、各サーバシステム１で日々発生する各種のログ情報(syslog)が情報収集部１２及び情報加工部１３を介してサーバシステム毎に定期的に読み込まれる。 The abnormality determination unit 14 determines an abnormality by comparing the system log (cause information source) with the normal state model (cause information source normal model) and the observation data (result information source) with the normal state model (result information source normal model). To do.
The abnormality determination unit 14 includes a log reading unit 21, a log clustering unit 22, and a log analysis unit 23, as shown in FIG. 3, in order to determine whether there is an abnormality by analyzing the system log. ing. In the log reading unit 21, various log information (syslog) generated every day in each server system 1 is periodically read for each server system via the information collecting unit 12 and the information processing unit 13.

ログクラスタリング部２２は、一つのサーバシステム（仮想ホスト）１に対して読み込まれた不定形なログ情報(syslog)をカテゴリ毎に分類する。ログ情報(syslog)は、時刻、ホスト名、プログラム名等の情報が含まれている。カテゴリは、ログ情報(syslog)を構成する項目の数や種類により複数のカテゴリに分類される。例えば、図４のように、一つのサーバシステム（仮想ホスト）から読み込まれた各ログ情報について、項目の数や種類（ログ情報のＡug719:00:04〜Ｖpxaに続く省略されたメッセージ部分が異なる）から同種のログ情報同士をカテゴリＡ，Ｂ，Ｃ，Ｄ…に分類する。 The log clustering unit 22 classifies the irregular log information (syslog) read for one server system (virtual host) 1 for each category. The log information (syslog) includes information such as time, host name, and program name. The categories are classified into a plurality of categories according to the number and types of items constituting the log information (syslog). For example, as shown in FIG. 4, for each log information read from one server system (virtual host), the number and type of items (log information omitted from Aug719: 00: 04 to Vpxa are different) ) Are classified into categories A, B, C, D.

モデル生成部１６では、ログ情報の分類結果を基に正常時におけるログ情報の遷移モデルを作成する。例えば、ログ情報をカテゴリＡ，Ｂ，Ｃ，Ｄに分類した図４の場合、遷移元と遷移先の平常モデルとして「カテゴリＡからＢ」「カテゴリＡからＤ」「カテゴリＢからＣ」「カテゴリＣからＡ」が登録（モデル化）される。
正常時におけるログ情報の遷移モデルは、例えば障害のないログ情報の１か月分から予め複数の遷移モデルとして登録しておく。
また、障害のないログ情報の遷移モデルは、正常モデルとして常時追加されるように構成してもよい。 The model generation unit 16 creates a transition model of log information at the normal time based on the log information classification result. For example, in the case of FIG. 4 in which log information is classified into categories A, B, C, and D, “category A to B” “category A to D” “category B to C” “category” C to A "are registered (modeled).
The log information transition model at the normal time is registered in advance as a plurality of transition models from one month of log information having no failure, for example.
Further, the transition model of log information without a failure may be configured to be always added as a normal model.

ログ分析部２３は、新たに読み込まれたログ情報のログ分類結果と、モデル生成部１６で作成された遷移モデルを比較し、ログ分類結果によるログ遷移が遷移モデルに無い場合をログ変化と認定する処理が行われる。例えば、正常モデルとして「カテゴリＡからＢ」「カテゴリＡからＤ」「カテゴリＢからＣ」「カテゴリＣからＡ」が登録（モデル化）されている場合に、「カテゴリＢからＣ」「カテゴリＣからＢ」「カテゴリＡからＢ」「カテゴリＢからＡ」「カテゴリＣからＡ」といった遷移を検知したとすると、「カテゴリＣからＢ」と「カテゴリＢからＡ」が正常モデルに無いため、新規遷移（ログ変化の認定による異常検知）を検出したことになる。
そして、ログ分析部２３がログ変化を検知した場合（異常が判定された場合）、障害推定部１５に対して障害推定指示信号が出力し、この信号を受けた障害推定部１５は、複数種類の観測データと前記結果予想モデルから障害を推定する処理が行われる。 The log analysis unit 23 compares the log classification result of the newly read log information with the transition model created by the model generation unit 16, and recognizes that the log transition based on the log classification result is not in the transition model as a log change. Processing is performed. For example, when “category A to B”, “category A to D”, “category B to C”, and “category C to A” are registered (modeled) as normal models, “category B to C” “category C” To B ”,“ Category A to B ”,“ Category B to A ”, and“ Category C to A ”are detected because“ Category C to B ”and“ Category B to A ”are not in the normal model. This means that a transition (abnormality detection by authorization of log change) has been detected.
When the log analysis unit 23 detects a log change (when abnormality is determined), a failure estimation instruction signal is output to the failure estimation unit 15, and the failure estimation unit 15 that has received this signal has a plurality of types. A process for estimating a fault from the observed data and the result prediction model is performed.

すなわち、障害推定部１５では、異常判定部１４で異常が判定された場合に、観測データによるネガティブな事象を考慮するとともに、直前のシステムログの単語出現分布と上述した結果予想モデルとを比較することでサイレント障害の有無を推定する。
過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布である結果予想モデルは、事前に初期設定部１１に読込まれるものであるが、新規遷移が形成されるパターンは、（ａ）既存カテゴリの組み合わせで形成される新規遷移と、（ｂ）新規カテゴリを含む新規遷移、が存在する。
（ａ）の既存カテゴリの組み合わせによる場合は、上述した例によると、「カテゴリＡからＢ」「カテゴリＡからＤ」「カテゴリＢからＣ」「カテゴリＣからＡ」が正常モデルとして登録（モデル化）されている場合に、正常モデルに無い「カテゴリＣからＢ」と「カテゴリＢからＡ」が該当する。
（ｂ）新規カテゴリを含む新規遷移は、新しいカテゴリを含む組み合わせが該当する。 That is, the failure estimation unit 15 considers a negative event based on the observation data when the abnormality determination unit 14 determines abnormality, and compares the word appearance distribution of the immediately preceding system log with the result prediction model described above. Thus, the presence or absence of silent failure is estimated.
A result prediction model that is a word appearance distribution within a certain time associated with a new transition appearance in the past system log is read in the initial setting unit 11 in advance, but a pattern in which a new transition is formed is ( There are a) a new transition formed by a combination of existing categories, and (b) a new transition including a new category.
In the case of the combination of existing categories in (a), according to the above-described example, “category A to B”, “category A to D”, “category B to C”, and “category C to A” are registered as normal models (modeled). "Category C to B" and "Category B to A" which are not in the normal model.
(B) A new transition including a new category corresponds to a combination including a new category.

本例においては、過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布である結果予想モデルは、新規遷移のカテゴリ構成要素に紐づく単語出現分布（出現率）情報により形成したが、新規遷移の検知数に紐づく単語出現分布（出現率）情報により形成するようにしてもよい。 In this example, the result prediction model, which is a word appearance distribution within a certain period of time associated with the appearance of a new transition in the past system log, is formed by word appearance distribution (appearance rate) information associated with the category component of the new transition. However, you may make it form with the word appearance distribution (appearance rate) information linked | related with the detection number of new transition.

モデル生成部１６は、正常状態モデルを生成保持（学習）するとともに、障害推定部１５によりサイレント障害が推定された結果予想モデルを生成保持する。
モデル生成部１６に障害推定部１５によりサイレント障害が推定された結果予想モデルが保持されている場合において、異常判定部１４でログ遷移が障害状態モデルのログ遷移パターンと同じであると判定された時、ある区間でのネガティブな呟きの出現数（率）が障害状態モデル以上で、収集した直近のログメッセージ中の一般的な分布類似計算手法による単語出現分布が障害状態モデルと類似している場合には、障害推定部１５において無条件にサイレント障害と推定する。
また、段落番号００３１で記載した（１）〜（３）の３項目について、ＡＮＤ条件で障害を推定するようにしてもよい。 The model generation unit 16 generates and holds (learns) a normal state model and also generates and holds a result prediction model in which the silent failure is estimated by the failure estimation unit 15.
When the model generation unit 16 holds the prediction model as a result of the silent failure estimated by the failure estimation unit 15, the abnormality determination unit 14 determines that the log transition is the same as the log transition pattern of the failure state model At times, the number of occurrences (rate) of negative whispering in a certain section is greater than or equal to the fault state model, and the word distribution by the general distribution similarity calculation method in the most recent log message collected is similar to the fault state model In this case, the failure estimation unit 15 unconditionally estimates a silent failure.
Further, for the three items (1) to (3) described in paragraph number 0031, a failure may be estimated using an AND condition.

タイマ管理部１７は、モデル作成部１６において一定期間学習することなく異常判定部１４においてシステムログの新規遷移を観測する非学習期間を管理する。
すなわち、ログ遷移監視により検出される新規遷移の組の数が、運用者が指定した閾値を超えた時点から運用者が指定した期間学習せず、新規遷移が毎時間閾値を超える数分出続けるか観測することにより、ノイズを除去し、目立たず継続する前記閾値を超えた状態を注視して異常として捉えることでサイレント障害の可能性を疑う。
この結果、一定期間学習を行わないことで、即座に障害とみなすことなく、継続的に観測することにより、ノイズ（障害ではない一時的な作業等による異常）を無視できる。また、サイレント障害の症状は目立たず継続して少しずつ悪化するケースが多いため、このような事象による障害を捉え易くすることができる。 The timer management unit 17 manages a non-learning period in which the abnormality determination unit 14 observes a new transition of the system log without learning for a certain period in the model creation unit 16.
That is, the number of new transition pairs detected by log transition monitoring does not learn for the period specified by the operator from the time when the threshold exceeds the threshold specified by the operator, and continues to appear for the number of times when the new transition exceeds the threshold every hour. By observing this, the possibility of a silent failure is suspected by removing the noise and observing the state exceeding the threshold that is inconspicuous and observing it as an abnormality.
As a result, by not performing learning for a certain period of time, noise (abnormality due to temporary work or the like that is not a failure) can be ignored by observing continuously without immediately considering it as a failure. In addition, since the silent failure symptom is inconspicuous and often worsens little by little, it is possible to easily recognize the failure due to such an event.

次に、上述した障害検知装置（監視サーバ）１０を使用して各種サーバシステム（監視対象ホスト）１のサイレント障害を検知する手順について、図１を参照して説明する。
手順１として、ＳＮＳ情報２からのネガティブな呟きの出現数（率）を定期的に観測し、観測結果を蓄積する。
手順２として、各サーバシステム１からの正常状態のログを読み込み、ログ遷移の正常状態モデルを作成する。
手順３として、各サーバシステム１から定期的にログを読み込み、新規遷移を監視する（一定期間継続して発生するか観測する）。 Next, a procedure for detecting a silent failure of various server systems (monitoring target hosts) 1 using the above-described failure detection apparatus (monitoring server) 10 will be described with reference to FIG.
As a procedure 1, the number (rate) of negative whispers appearing from the SNS information 2 is regularly observed, and the observation results are accumulated.
As procedure 2, a normal state log from each server system 1 is read, and a normal state model of log transition is created.
As a procedure 3, a log is periodically read from each server system 1 and new transitions are monitored (observed whether they occur continuously for a certain period of time).

手順４として、連続して新規遷移を検出した場合に、検出した期間内のネガティブな呟きの出願数（率）が閾値を超えている場合、双方の異常から障害を疑い、次の手順に進む。
手順５として、検出した新規遷移のパターンに紐づく単語出現分布（結果予想モデル）と、直近のログメッセージ中の単語出現分布をチェックする。 As Step 4, if new transitions are detected continuously and the number of negative applications (rate) within the detected period exceeds the threshold, a failure is suspected from both abnormalities and the process proceeds to the next step .
As procedure 5, the word appearance distribution (result prediction model) associated with the detected new transition pattern and the word appearance distribution in the latest log message are checked.

結果予想モデルと直近のシステムログとの照合は、例えば該当する結果予側モデルが上述した結果予想モデルＸ（単語出現分布）である場合、単語の分布モデルが異なっていないかついて判定される。異なっていれば、従前の説明変数で説明できない事象を検知したこととなり、サイレント障害を疑うこととなる。
結果予想モデルＸ（単語出現分布）と、直前ログの単語出現率の分布モデルとが異なっているかどうかの判定は、例えば、前ログの単語出現率の分布モデルの各単語の出現率が結果予想モデルＸの各単語の出現率とどのくらい一致しているか（乖離しているか）により判定される。どの程度の一致（乖離）による判断基準は、初期設定部１１への入力で予め設定することができる。
また、結果予想モデルと直近のシステムログとの照合を行うに際して、マハラノビス距離、Ｚ検定、Ｔ検定等の分析手法により、直前システムログの状態と比較してもよい。 The collation between the result prediction model and the latest system log is determined, for example, when the corresponding result prediction model is the result prediction model X (word appearance distribution) described above and the word distribution model is not different. If they are different, an event that cannot be explained by the previous explanatory variable is detected, and a silent failure is suspected.
The determination of whether the result prediction model X (word appearance distribution) is different from the word appearance rate distribution model of the immediately preceding log is based on, for example, the appearance rate of each word in the distribution model of the word appearance rate of the previous log. Judgment is made based on the degree of coincidence with the appearance rate of each word in the model X. The determination criterion based on the degree of coincidence (deviation) can be set in advance by input to the initial setting unit 11.
Further, when the result prediction model is compared with the latest system log, it may be compared with the state of the immediately preceding system log by an analysis method such as Mahalanobis distance, Z test, T test, or the like.

手順６として、観測した直近の単語出現分布が、結果予側モデルとかけ離れていた場合（予想精度が低かった場合）、未知の障害を疑い、実際のシステムログの分布モデルにおいて、予想外に出現したり、出現しなかった単語を運用者に通知することで、サイレント障害原因追究の参考情報として提供することができる。
手順７として、既知の障害として結果予想モデルはモデル生成部１６で学習される。
また、障害と判断された状態を、システム稼働開始時あるいは稼働中に運用者の指示により障害モデルに手動操作により学習させるようにしてもよい。 As step 6, if the observed recent word appearance distribution is far from the result prediction model (if the prediction accuracy is low), an unknown failure is suspected and appears unexpectedly in the actual system log distribution model Or by notifying the operator of the words that did not appear, it can be provided as reference information for investigating the cause of silent failure.
As a procedure 7, the result prediction model is learned by the model generation unit 16 as a known failure.
Further, the state determined to be a failure may be learned by manual operation in the failure model at the start of the system operation or during operation according to an instruction from the operator.

各種サーバシステム（監視対象ホスト）１のサイレント障害を検知するに際して、一定期間、ログの新規遷移とＳＮＳ情報（ネガティブな呟きの出現）２を監視した例を図５に示す。
図５において、棒状グラフで検出された各システムログにおける障害個数は、単発では障害と疑われる閾値を下回っているが、閾値より低い異常が連続的に発生している状態（状態Ａ）であり、その期間において、ネガティブな呟きの出現数（もしくは出現率）が閾値以上（状態Ｂ）であれば、状態Ａ及び状態Ｂからサイレント障害を疑うことが行われる。
すなわち、状態Ｂにおいては、ＳＮＳ情報２の呟き情報において、ネガティブワードの出現数（もしくは出現率）が、ある期間において運用者が設定した閾値未満の場合は正常状態とし、閾値以上の場合は異常と判断する。 FIG. 5 shows an example in which a new log transition and SNS information (appearance of negative whispering) 2 are monitored for a certain period when a silent failure of various server systems (monitoring target hosts) 1 is detected.
In FIG. 5, the number of failures in each system log detected by the bar graph is below a threshold value that is suspected of being a failure in a single occurrence, but an abnormality that is lower than the threshold value is continuously occurring (state A). In that period, if the number of negative whispers (or appearance rate) is equal to or greater than the threshold (state B), a silent failure is suspected from state A and state B.
That is, in the state B, in the SNS information 2 scatter information, if the number of negative words (or appearance rate) is less than the threshold set by the operator in a certain period, the normal state is assumed. Judge.

また、閾値を超える新規遷移の出現数は、全て継続して観測してもよいし、特定のキーワードを含む新規遷移のみ継続して観測するようにしてもよい。 In addition, the number of new transitions exceeding the threshold may be continuously observed, or only new transitions including a specific keyword may be continuously observed.

尚、図５においては、ＳＮＳ情報２における状態Ｂは、Twitter情報（どのような呟きがされているか）を例にしたが、これに代えた（若しくは加えた）判定基準として、
・コールセンター情報（お客様からどのような問い合わせを受けているか）
・Ｗｅｂアクセスログ（どのようなＷｅｂサイトにアクセスされているか）
・ＧＰＳ情報（どのような場所に移動しているか）
・サービスへのアクセス数
といったものも考慮することができる。
これらの情報についても、正常状態を定義することで異常を識別可能とするため、本発明の障害検知方法を実現するための状態判定要素の一つとなり得る。
また、複数の観測対象のいずれかを観測データとして使用するかについては、障害検知装置のシステム稼働開始時（初期設定パラメータ入力時）、あるいは稼働中に運用者が指定可能なようにしてもよい。 In FIG. 5, the state B in the SNS information 2 is an example of Twitter information (what kind of whispering), but as an alternative (or added) criterion,
・ Call center information (what kind of inquiries are received from customers)
・ Web access log (what kind of website is being accessed)
・ GPS information (where you are moving)
・ The number of accesses to the service can also be considered.
These pieces of information can also be one of the state determination elements for realizing the failure detection method of the present invention because the abnormality can be identified by defining the normal state.
In addition, regarding which of a plurality of observation targets to use as observation data, the operator may be able to specify when the system operation of the failure detection apparatus starts (when initial setting parameters are input) or during operation. .

これらにおいて、正常と異常を判断する場合の手法について、以下に説明する。
・コールセンター情報（お客様からどのような問い合わせを受けているか）による場合
対応者のメモおよびテキスト化された音声データにおいて、ネガティブワードの出現数（もしくは出現率）が、ある期間において運用者が設定した閾値未満の場合は正常状態とし、閾値以上の場合は異常と判断する。 In these, the method in the case of judging normality and abnormality is demonstrated below.
・ In the case of call center information (what kind of inquiries are received from customers) The number of negative words (or appearance rate) set by the operator in a certain period in the memo and text data of responders If it is less than the threshold, it is determined to be normal, and if it is greater than or equal to the threshold, it is determined to be abnormal.

・Ｗｅｂアクセスログ（どのようなＷｅｂサイトにアクセスされているか）による場合
観測対象システムを利用して提供されるサービスや、該当サービスを提供する会社に関わるホームページのアクセス数ランキングを正常時にモデル化しておき、ある期間における各ホームページへのアクセス数をモデルと照合して異常を判断する。
例えば、ＫＤＤＩが提供する各ホームページへのアクセス数を正常時にモデル化しておき、普段は滅多にアクセスされないＨＰへのアクセスが急に多くなった場合に異常と判断する。
また、関連ホームページ内で検索されるキーワードにおいて、ネガティブワードの出現数（もしくは出現率）が、ある期間において運用者が設定した閾値以上の場合も異常と判断する。・ When using Web access logs (what type of website is being accessed) Modeling the ranking of the number of access to the services provided using the observation target system and the homepage related to the company providing the service in a normal state In addition, the number of accesses to each home page in a certain period is compared with the model to determine an abnormality.
For example, the number of accesses to each homepage provided by KDDI is modeled in a normal state, and when there is a sudden increase in access to HPs that are rarely accessed, it is determined that there is an abnormality.
In addition, in the keyword searched in the related home page, it is also determined as abnormal when the number of negative words (or appearance rate) is equal to or greater than a threshold set by the operator in a certain period.

・位置情報（どのような場所に移動しているか）による場合
観測対象システムを利用して提供されるサービスに接続する端末のGPS情報等から、正常時の動き（ＧＰＳ等の履歴）をモデル化しておき、ある期間においてモデルに無い位置へ移動した端末数が、運用者が設定した閾値以上の場合は異常と判断する。・ In the case of location information (where you are moving) Modeling normal movement (GPS history) from GPS information of the terminal connected to the service provided using the observation target system In addition, when the number of terminals that have moved to a position not included in the model in a certain period is equal to or greater than a threshold set by the operator, it is determined as abnormal.

・サービスへのアクセス数による場合
観測対象システムを利用して提供されるサービスに対する接続要求数を正常時にモデル化しておき、ある期間におけるサービスへの接続要求数が、ある期間において運用者が設定した上下閾値の以上もしくは以下となった場合に異常と判断する。・ In the case of the number of access to the service The number of connection requests for the service provided using the observation target system is modeled at normal time, and the number of connection requests to the service in a certain period is set by the operator in a certain period When the value is above or below the upper and lower thresholds, it is determined as abnormal.

上述した障害検知装置によれば、直前のシステムログの単語出現分布と、過去のシステムログの新規遷移出現時に紐付いた一定時間内の単語出現分布である結果予想モデルとを比較するとともに、複数種類の観測データを分析することで障害か否かの判断を行うので、検知した異常をサイレント障害と判断する精度を高めることができる。 According to the above-described failure detection device, the word appearance distribution of the immediately preceding system log is compared with the result prediction model that is the word appearance distribution within a certain period of time associated with the new transition appearance of the past system log, and a plurality of types Therefore, it is possible to improve the accuracy of determining the detected abnormality as a silent failure.

システムログのログ遷移に異常があった場合においても、即座に障害とみなさず継続的に観測することにより、ノイズ（障害ではない一時的な作業等による異常）を無視できる。また、サイレント障害の症状は目立たず継続して少しずつ悪化するケースが多いため、該当障害を捉え易くすることができる。 Even when there is an abnormality in the log transition of the system log, noise (abnormality due to temporary work etc. that is not an obstacle) can be ignored by observing continuously without being regarded as an obstacle immediately. In addition, since the symptom of silent failure is not conspicuous and often deteriorates little by little, the failure can be easily recognized.

結果予測モデルと共起変数の特定により、事象と相関のある新規の原因（未知の原因）を特定できる。すなわち、予想とかけ離れた異常状態を見つけることにより、未知の状態を検出することができる。 By identifying the outcome prediction model and co-occurrence variables, it is possible to identify a new cause (an unknown cause) correlated with the event. That is, an unknown state can be detected by finding an abnormal state that is far from expected.

１…サーバシステム（監視対象ホスト）、２…ＳＮＳ情報、１０…障害検知装置（監視サーバ）、１１…初期設定部、１２…情報収集部１２、１３…データ情報加工部、１４…異常判定部、１５…障害推定部、１６…モデル生成部、１７…タイマ管理部、１８…アラーム発行部１８、２１…ログ読込部、２２…ログクラスタリング部、２３…ログ分析部。 DESCRIPTION OF SYMBOLS 1 ... Server system (monitoring target host), 2 ... SNS information, 10 ... Failure detection apparatus (monitoring server), 11 ... Initial setting part, 12 ... Information collection part 12, 13 ... Data information processing part, 14 ... Abnormality determination part 15 ... Fault estimation unit, 16 ... Model generation unit, 17 ... Timer management unit, 18 ... Alarm issuing unit, 21 ... Log reading unit, 22 ... Log clustering unit, 23 ... Log analysis unit.

Claims

In the failure detection device that monitors the occurrence of silent failure on the monitored host,
An abnormality determination unit that determines an abnormality by comparing a system log in the monitored host with a normal state model that is a log transition of a system log that has existed in the past in the monitored host;
When an abnormality is determined by the abnormality determination unit, a negative event due to at least one observation data among the SNS information, call center information in the carrier, user behavior information, and the number of accesses to the service is taken into account. A failure estimation unit that estimates silent failure by comparing a word appearance distribution of a log with a result prediction model that is a word appearance distribution within a certain period of time associated with a new transition appearance of a past system log;
A failure detection apparatus comprising:

In the failure detection device that monitors the occurrence of silent failure on the monitored host,
An initial setting unit that reads a log transition of a system log that existed in the past in the monitored host as a normal state model, and that reads a word appearance distribution within a certain time associated with a new transition appearance of a past system log as a result prediction model; ,
An information collection unit that reads a system log, SNS information, call center information in a carrier, user behavior information, and at least one observation data of the number of accesses to a service by log transition monitoring in the monitored host;
A data information processing unit that processes the system log into a format comparable to the models;
An abnormality determination unit that determines an abnormality by comparing the system log and the normal state model;
A failure for estimating a silent failure by comparing a word appearance distribution of the immediately preceding system log with the result prediction model while considering a negative event based on the observation data when an abnormality is determined by the abnormality determination unit An estimation unit;
A model generation unit that generates and holds a normal state model based on a learning effect, and generates and holds a result prediction model in which a silent failure is estimated by the failure estimation unit;
A timer management unit that manages a non-learning period in which a new transition of the system log is observed in the abnormality determination unit without learning for a certain period in the model creation unit;
A failure detection apparatus comprising:

In the method for monitoring the occurrence of a silent failure on a monitored host,
A normal state model is generated and held for each of the monitored host system log and at least one of the SNS information, call center information in the carrier, user behavior information, and number of accesses to the service, and a plurality of types of observation items A procedure for observing whether there is an abnormality by combining
A word appearance distribution within a certain period of time associated with the appearance of a new transition in the past system log of the monitored host is retained as a result prediction model, and a word appearance distribution of the immediately preceding system log when abnormality is simultaneously observed in a plurality of observation items And a procedure for determining the presence or absence of silent failure by comparing the result prediction model,
A failure detection method characterized by comprising:

4. The result prediction model is formed by word appearance distribution (appearance rate) information associated with the number of new transitions detected or word appearance distribution (appearance rate) information associated with a category component of a new transition. Failure detection method.

The failure detection method according to claim 3, wherein the result prediction model is compared with the state of the immediately preceding system log by an analysis method such as Mahalanobis distance, Z test, or T test.

The number of sets of new transitions detected by the log transition monitoring exceeds the threshold specified by the operator and is not learned for the period specified by the operator, and continues to appear for the number of times the new transition exceeds the threshold every hour. The failure detection method according to claim 3, wherein the possibility of a silent failure is suspected by observing the state exceeding the threshold value that is not conspicuously continued by removing noise and capturing it as an abnormality.

The fault detection method according to claim 6, wherein all new transitions exceeding the threshold are continuously observed.

The fault detection method according to claim 6, wherein new transitions exceeding the threshold are continuously observed only for new transitions including a specific keyword.

The failure detection method according to claim 3, wherein an operator can designate a plurality of observation targets at the start of system operation or during operation.

The failure according to claim 3, wherein a state determined as a failure is stored as a failure state model, and when a similar state is detected thereafter, information extracted from a past failure state model can be presented as a reference value. Detection method.

The failure detection method according to claim 3, wherein the failure model can be made to learn a state determined to be a failure at the start of system operation or during operation according to an instruction from an operator.

A failure detection program capable of executing, by a computer, each procedure of the failure detection method for a monitored host according to claim 3.

A computer-readable recording medium in which the failure detection program according to claim 12 is stored.