JP2013077194A

JP2013077194A - Information system device utilizing knowledge

Info

Publication number: JP2013077194A
Application number: JP2011217183A
Authority: JP
Inventors: Hiroshi Sugimura; 博杉村
Original assignee: Individual
Current assignee: Individual
Priority date: 2011-09-30
Filing date: 2011-09-30
Publication date: 2013-04-25

Abstract

PROBLEM TO BE SOLVED: To provide a general-purpose device configured to operate without using the knowledge of a user for performing new knowledge discovery with respect to time-sequential data to be analyzed by installing a function for outputting partial shapes for predicting the classification of the time-sequential data or its combination.SOLUTION: An information system device includes four functions, that are, a feature pattern discovery function, a classification rule extraction function, a classification prediction function, and a feature pattern improvement function. The shapes of time-sequential data as the features of each classification are discovered by the feature pattern discovery function, and a classification rule using the feature pattern is extracted by the classification rule extraction function. The shape of the feature pattern is improved by the feature pattern improvement function, and the classification rule whose quantity is high is extracted by using the improved pattern. The classification rule extracted in this way is utilized by inputting unclassified data, and the classification of the time-sequential data is predicted.

Description

本発明は時系列データの分類を予測するための知識を自動的に発見し、蓄積、活用するための情報システムの装置に関する。 The present invention relates to an information system apparatus for automatically discovering, accumulating and utilizing knowledge for predicting classification of time series data.

データマイニングによってデータベース内の大量のデータから発見的な知識獲得が可能となった。本発明で扱う時系列データとは時間の経過にそって記録した数値データのシーケンスで、データマイニング技術を適用することで、新たな知識の発掘を目的としている。実社会においてこのような形式のデータは様々な分野で頻出し、これらを解析することで、突然の降水量の急増、大地震の発生、株価の暴落などの変動が予測できると考えられ、重要なテーマとなっている。 Data mining enables heuristic knowledge acquisition from a large amount of data in the database. The time-series data handled in the present invention is a sequence of numerical data recorded along with the passage of time, and aims to discover new knowledge by applying a data mining technique. In the real world, this type of data appears frequently in various fields, and by analyzing these data, it is expected to be able to predict fluctuations such as a sudden increase in precipitation, the occurrence of a large earthquake, and a fall in stock prices. It has become a theme.

非特許文書１では時系列データからの効率的な頻出パタンの発見方法を提案しており、この手法は頻出パタンを特徴的なパタンとしている。しかし、単純に頻出パタンを抽出する手法は多くの解析者にとって既知であるか、どのようなデータにおいても共通的な形を抽出する場合が多く、興味深いパタンを発見できない場合が多い。本発明ではデータ間の違いを表すためのパタンの抽出を目指す。 Non-Patent Document 1 proposes an efficient frequent pattern finding method from time-series data, and this method uses the frequent pattern as a characteristic pattern. However, a method of simply extracting a frequent pattern is known to many analysts, or a common form is often extracted from any data, and an interesting pattern cannot often be found. The present invention aims to extract patterns for representing differences between data.

非特許文書２では, 抽出すべき時系列データの部分的な形状をユーザが指定することによって解析を行う。ユーザの知識を用いることによって、解析者の興味に従ったパタンを抽出できるが、解析者の想像しない形状を獲得するためには多くの試行錯誤を要する。さらに、このように背景知識に依存した手法は知識のないユーザにとっては使いづらい手法といえる。 In non-patent document 2, analysis is performed by the user specifying a partial shape of time-series data to be extracted. By using the user's knowledge, it is possible to extract a pattern according to the interest of the analyst, but it takes a lot of trial and error to obtain a shape that the analyst does not imagine. Furthermore, it can be said that such a method that depends on background knowledge is difficult to use for a user without knowledge.

非特許文書３では株価データに対して、さまざまな株価データに付随する18種類の指標をクラスタリングすることによって決定木学習に使用する属性を抽出している。この手法で作られた決定木とトレーニングデータから、未来予測に有効なパタンを抽出できると考えられるが、この手法は株価の特徴に依存し、汎用的な解析手法といえない。また株価データと違い、気温や湿度のように多くの指標がない時系列データに対して適用できない。 In Non-Patent Document 3, the attributes used for decision tree learning are extracted by clustering 18 kinds of indices attached to various stock price data with respect to the stock price data. Although it is thought that effective patterns for future prediction can be extracted from decision trees and training data created by this method, this method depends on the characteristics of stock prices and cannot be said to be a general-purpose analysis method. Unlike stock price data, it cannot be applied to time-series data that does not have many indicators such as temperature and humidity.

Agrawal, R. and Srikant, R.: Mining sequential patterns, in Data Engineering, 1995. Proceedings of the Eleventh International Conference on, pp. 3-14, IEEE (2002)Agrawal, R. and Srikant, R .: Mining sequential patterns, in Data Engineering, 1995.Proceedings of the Eleventh International Conference on, pp. 3-14, IEEE (2002) Haigh, K., Foslien, W., and Guralnik, V.: Visula Query Language: Finding patterns in and relationships among time series data, in Proceedings of the seventh Workshop on Mining Scientific and Engineering Datasets, Citeseer (2004)Haigh, K., Foslien, W., and Guralnik, V .: Visula Query Language: Finding patterns in and relationships among time series data, in Proceedings of the seventh Workshop on Mining Scientific and Engineering Datasets, Citeseer (2004) Abe, H., Hirabayashi, S., Ohsaki, M., and Yamaguchi, T.: Evaluating a Trading Rule Mining Method based on Temporal Pattern Extraction, in The Third International Workshop on Mining Complex Data (MCD2007) In Conjunction with ECML/PKDD 2007, pp. 49-58 (2007)Abe, H., Hirabayashi, S., Ohsaki, M., and Yamaguchi, T .: Evaluating a Trading Rule Mining Method based on Temporal Pattern Extraction, in The Third International Workshop on Mining Complex Data (MCD2007) In Conjunction with ECML / PKDD 2007, pp. 49-58 (2007)

本発明は、上記従来技術の抱える問題に鑑みてなされたものである、時系列データの分類を予測するための部分的な形状やその組み合わせを出力する機能を有することによって、ユーザの知識を使わずに動作し解析対象の時系列データに対して新たな知識発見が行える、汎用的な装置を提供することを目的とする。 The present invention has been made in view of the above-described problems of the prior art, and has a function of outputting partial shapes and combinations thereof for predicting classification of time series data, thereby using user knowledge. It is an object to provide a general-purpose device that operates without any problem and can discover new knowledge for time-series data to be analyzed.

装置はまず時系列データから特徴パタンを発見する。時系列データに頻出する形状はそのデータを解析する際に重要であることは明白であるが、その一方で、複数の分類にわたって広く出現する形状はそれらを特定するためには役に立たない。本発明は時系列データの分類を目的として、分類間を特徴づけるパタン発見のためにTF*IDFを時系列データに対して適用する。 The device first finds a feature pattern from the time series data. Obviously, shapes that appear frequently in time series data are important in analyzing the data, while shapes that appear widely across multiple categories are useless to identify them. The present invention applies TF * IDF to time-series data for the purpose of classifying time-series data in order to find a pattern that characterizes the classification.

TF*IDFはテキストマイニングの分野で用いられており、その値は単語の重要度の統計的な計測値として評価される。1つの文書は単語の集合であり、データベースは文書の集合である。ある単語の重要度は1つの文書内で頻出するほど増加し、データベースでその単語が出現する文書数が多いほど減少する。通常、単語の出現頻度TFは文書tjでの単語wiの出現確率とする。特定の文書に出現する単語はすべての文書に出現する単語よりも重要度が高い。これを逆文書頻度といい、Nを文書数、nを単語wiが少なくとも1つ含まれている文書の個数とするとき、TF*IDFは下記の数式１で定義される。
TF * IDF is used in the field of text mining, and its value is evaluated as a statistical measure of word importance. One document is a set of words, and the database is a set of documents. The importance of a word increases as it appears frequently in one document, and decreases as the number of documents in which the word appears in the database increases. Usually, the word appearance frequency TF is the appearance probability of the word wi in the document tj. Words that appear in a particular document are more important than words that appear in all documents. This is called reverse document frequency, where N is the number of documents and n is the number of documents including at least one word wi, TF * IDF is defined by the following Equation 1.

本発明はこのTF*IDFの概念を拡張して時系列データに適用する。１つの時系列データを１つの文書とみなし、時系列データ中の一部分を部分時系列データと呼び、１つの単語として考える。この時、単純に時系列データから得られる固定長の部分を抽出すればその総数はとても多くなるため、代表的な部分時系列データの形状を発見する必要がある。この目的のために、提案装置は抽出された部分時系列データに対してクラスタリングを行う。時系列データに対してクラスタリングを行うことについては議論の余地があるが、抽出された多くの部分時系列データからいくつかの代表的な部分時系列データ、またはそれを示す形状を発見できればどのような手法を用いてもよく、本発明の本質ではないため言及しない。 The present invention extends this concept of TF * IDF and applies it to time series data. One time series data is regarded as one document, and a part of the time series data is called partial time series data and is considered as one word. At this time, if a fixed-length portion obtained simply from time-series data is simply extracted, the total number thereof becomes very large. Therefore, it is necessary to find a representative partial time-series data shape. For this purpose, the proposed device performs clustering on the extracted partial time series data. Although clustering of time series data is controversial, what if we can find some representative partial time series data or a shape that represents it from many extracted partial time series data? Such a method may be used and is not mentioned because it is not the essence of the present invention.

クラスタリングには、一般的な手法の１つであるk-meansを用いる。Mクラスタリングは与えられたデータセットに対して互いに素なM個のサブセット（クラスタ）に分ける事を目的として、クラスタリングの基準を最適化する。クラスタリングの基準は各データxと、xiを含むサブセットXiであるところのクラスタの中心（セントロイド）ciとの距離である。k個のセントロイドをc1, ..., ckとして、D(p,q)をデータpとqの距離を計算する関数としたとき、クラスタリングの基準Errは下記の数式２で定義される。
For clustering, k-means which is one of general methods is used. M clustering optimizes clustering criteria for the purpose of dividing a given data set into disjoint M subsets (clusters). The standard of clustering is the distance between each data x and the center (centroid) ci of the cluster which is a subset Xi including xi. When k centroids are c1,..., ck and D (p, q) is a function for calculating the distance between the data p and q, the clustering reference Err is defined by the following Equation 2.

距離関数D(p,q)として最も広く使われる手法はユークリッド距離かその拡張である。図１にユークリッド距離の計算を行うための２つの時系列データのポイントの対応例を示す。図１の（ａ）の通り、ユークリッド距離は、同じ長さの時系列データのペアにしか適用できないうえに、人間の直感に反する結果を生じてしまう場合がある。人間は時系列データの形を柔軟に認識できるのに対し、この方法では時間方向の対応が固定化されるためと考えられる。 The most widely used method for the distance function D (p, q) is Euclidean distance or its extension. FIG. 1 shows a correspondence example of two time-series data points for calculating the Euclidean distance. As shown in FIG. 1A, the Euclidean distance can be applied only to a pair of time-series data having the same length, and may cause a result contrary to human intuition. This is probably because humans can flexibly recognize the form of time-series data, but this method fixes the correspondence in the time direction.

Dynamic Time Warping（以降、DTW）では2つの時系列データのポイント間を柔軟に対応づけ、人間の直感にあう距離計算を行う。時系列データ間の最適な類似度を計算するために、横軸と縦軸を拡縮することによって累積距離を最小化することで適切に対応付ける。図１の（ｂ）にこの対応例を示す。 Dynamic Time Warping (hereinafter DTW) flexibly associates two time-series data points and calculates distances that match human intuition. In order to calculate the optimum degree of similarity between time series data, the horizontal distance and the vertical axis are scaled to minimize the cumulative distance and appropriately correspond. FIG. 1B shows an example of this correspondence.

長さiとjを持つ2つの時系列データA、B間のDTW距離g(A,B)は下記の数式３で定義される。A(i-1)とB(j-1)はA(i)とB(j)の最後の値を取り除いたシーケンスである。数式３のqとrはシーケンスの拡縮に伴うコストであり、sは値の不一致によるコストである。類似しているペアは値が小さく、完全に一致する2つのシーケンスは0になる。
The DTW distance g (A, B) between the two time series data A and B having the lengths i and j is defined by the following Equation 3. A (i-1) and B (j-1) are sequences obtained by removing the last values of A (i) and B (j). In Equation 3, q and r are costs associated with the expansion / contraction of the sequence, and s is a cost due to mismatch of values. Similar pairs have small values, and two sequences that match exactly will be zero.

分類ルールの抽出は特徴発見機能によって獲得された特徴パタンを用いて行われる。分類ルールの抽出には決定木学習を用いる。決定木学習は教師あり学習の一種で、トレーニングデータを入力することによって属性を枝とした木構造の意思決定モデルを作成する。図２の（ｃ）にトレーニングデータの例を示す。P1からP3は特徴発見機能で発見された特徴パタンである。S1からS4はデータベース中のサブシーケンスである。全てのサブシーケンスと各特徴パタンとの距離についてDTWを用いて計算して属性値とする。トレーニングデータの各インスタンスはクラスラベルに関連付けられている。これらの距離とクラスラベルの組による1つの表がトレーニングデータである。このトレーニングデータを用いて決定木学習を行った結果として、図２の（ｄ）に示すような知識を獲得できる。 The classification rule is extracted using the feature pattern acquired by the feature finding function. Decision tree learning is used to extract classification rules. Decision tree learning is a type of supervised learning, in which a decision model with a tree structure with attributes as branches is created by inputting training data. An example of training data is shown in FIG. P1 to P3 are feature patterns discovered by the feature discovery function. S1 to S4 are subsequences in the database. The distance between all the sub-sequences and each feature pattern is calculated using the DTW and set as an attribute value. Each instance of training data is associated with a class label. One table with these distance and class label pairs is the training data. As a result of performing decision tree learning using this training data, knowledge as shown in FIG. 2D can be acquired.

決定木学習を行った後の装置に対して未分類の時系列データを入力すると、分類ルールに従ってその分類を予測する。予測は次の手順で行われる。
１．分類ルールに用いられているすべての特徴パタンとの距離についてDTWを用いて計算する。
２．分類ルールの根ノードを調査ノードとする。
３．調査ノードと未分類データとの比較をおこない、調査ノードで示された条件にしたがって次のノードに移動する。
４．移動先が枝の場合は現在のノードを調査ノードとして、３から繰り返し実行する。
５．葉の場合にはそのクラス値を出力する。 When unclassified time-series data is input to the device after the decision tree learning, the classification is predicted according to the classification rule. The prediction is performed according to the following procedure.
1. The distance from all feature patterns used in the classification rule is calculated using DTW.
2. Let the root node of the classification rule be the survey node.
3. The survey node and the unclassified data are compared, and the next node is moved according to the condition indicated by the survey node.
4). When the movement destination is a branch, the current node is set as the investigation node, and the process is repeatedly executed from 3.
5. If it is a leaf, its class value is output.

本装置は、発見された特徴パタンを改良することによってさらに高い品質の分類ルールを提供するための機能を有する。特徴パタンを改良する手法については山登り法などの探索手法やQ-learningなどの強化学習などのいくつかの手法が考えられるが、単純な探索や強化学習ではすぐに局所解に陥ってしまうためにこれを避ける必要がある。このため、本研究では遺伝的アルゴリズム（GA）を用いて改良を行う。GAは自然淘汰と遺伝という進化論のアイデアに基づいた発見的な探索アルゴリズムである。遺伝的アルゴリズムは多様で複雑な評価関数から近似解を求める場合に有効な方法で、最適化問題や探索問題などに適用されている。決定木の枝や葉を染色体として符号化して遺伝的操作である交差や突然変異を適用する手法があるが本発明は予測のための特徴的な時系列データの形状の発見も重要な目的である。従って、特徴パタンを改良することによって決定木の品質を高める手法を開発した。装置は最終的に、学習した決定木だけでなく、それに用いた改良後の特徴パタンもセットにして出力する。 The device has the capability to provide higher quality classification rules by improving the discovered feature patterns. There are several methods for improving the feature pattern, such as search methods such as hill-climbing and reinforcement learning such as Q-learning. However, simple search and reinforcement learning quickly fall into local solutions. It is necessary to avoid this. For this reason, in this study, improvements are made using genetic algorithms (GA). GA is a heuristic search algorithm based on the evolutionary idea of natural selection and heredity. The genetic algorithm is an effective method for obtaining an approximate solution from various and complex evaluation functions, and is applied to an optimization problem and a search problem. There is a technique to encode branches and leaves of decision trees as chromosomes and apply genetic operations such as intersection and mutation, but the present invention also has an important purpose of finding the shape of characteristic time-series data for prediction. is there. Therefore, we developed a technique to improve the quality of decision trees by improving the feature pattern. Finally, the apparatus outputs not only the learned decision tree but also the improved feature pattern used for it as a set.

特徴パタンを改良する機能の動作概要は次のようになる。
１．特徴抽出機能によって得られたパタンを、初期特徴パタンとする。
２．特徴パタンを用いて決定木を作成する。
３．作成した決定木を評価する。
４．停止条件を充たしていたら、作成した決定木とそれに用いた特徴パタンを出力する。
５．停止条件を充たしていなければ、GAを使って特徴パタンを改良する。
６．改良した特徴パタンをもとに２から処理を繰り返す。 The operation outline of the function for improving the feature pattern is as follows.
1. The pattern obtained by the feature extraction function is set as the initial feature pattern.
2. A decision tree is created using feature patterns.
3. Evaluate the created decision tree.
4). If the stop condition is satisfied, the created decision tree and the feature pattern used for it are output.
5. If the stop condition is not met, use GA to improve the feature pattern.
6). The process is repeated from 2 based on the improved feature pattern.

GAを用いた特徴パタンの改良方法の説明は、まず特徴パタンとGAにおける染色体との対応について説明し、染色体の評価、交叉、そして突然変異について説明する。 In the explanation of the method for improving the feature pattern using GA, first, the correspondence between the feature pattern and the chromosome in GA will be explained, and the evaluation, crossover, and mutation of the chromosome will be explained.

1つの染色体が1つの特徴パタンに対応しており、染色体の操作は特徴パタンの操作と同等である。特徴パタンの数値を変化率に正規化した値のシーケンスを染色体とする。つまり、個体は実数値から構成される長さnのシーケンスC = (x1, … xn )となる。図３に1つの特徴パタンに対応する染色体の例を示す。扱う遺伝子は離散的な変数ではなく実数値であるため、単純な0と1のビット列を扱う手法ではなく、実数値GAによって遺伝子改良を行う。 One chromosome corresponds to one feature pattern, and the manipulation of the chromosome is equivalent to the manipulation of the feature pattern. A sequence of values obtained by normalizing the numerical value of the feature pattern to the rate of change is defined as a chromosome. That is, an individual is a sequence C = (x1,... Xn) of length n composed of real values. FIG. 3 shows an example of a chromosome corresponding to one feature pattern. Since the gene to be handled is not a discrete variable but a real value, it is not a method of handling a simple bit string of 0 and 1, but gene improvement is performed by a real value GA.

染色体の評価値には、その染色体と対応する特徴パタンを用いたDTW距離によってトレーニングデータを分類した際の獲得情報量を用いる。獲得情報量Gain(X)は次の数式４と数式５で計算できる。ここで、Tは分割前の集合であり、ある分割によってn個の部分集合Ti（iは1より大きくn以下）個に分割されるものとする。info (S)は情報エントロピーと呼ばれる量で次の数式６で定義される。 Sは集合を示し、 freq(Cj,S)はクラスCjに属するSの要素数、kはクラス数である。
For the evaluation value of the chromosome, the amount of information acquired when the training data is classified by the DTW distance using the feature pattern corresponding to the chromosome is used. The acquired information amount Gain (X) can be calculated by the following equations 4 and 5. Here, T is a set before division, and is divided into n subsets Ti (i is greater than 1 and less than or equal to n) by a certain division. info (S) is an amount called information entropy and is defined by the following Equation 6. S indicates a set, freq (Cj, S) is the number of elements of S belonging to class Cj, and k is the number of classes.

選択ではルーレット選択を行う。この方法では適応度の高い染色体ほど選ばれる確率が大きくなるが、適応度の低い染色体でも次世代残される可能性があり、適応度が高い染色体だけが残ることによって局所解にとらわれやすくなるのを防ぐ。ルーレット選択は次の手順で行う。
１．各染色体の評価値を基にして、ルーレットのスケーリングを行う。
２．ルーレット選択によって選択した染色体を残す。
３．手順２を繰り返し、あらかじめ定めておいた数になるまで次世代に残す染色体を選択する。 In selection, roulette selection is performed. In this method, the higher the fitness of the chromosome, the greater the probability of being selected, but there is a possibility that the next generation will remain even in the low fitness chromosome, and it is easy to be caught by the local solution by leaving only the high fitness chromosome. prevent. The roulette selection is performed according to the following procedure.
1. Roulette scaling is performed based on the evaluation value of each chromosome.
2. Leave the chromosome selected by roulette selection.
3. Repeat step 2 and select the chromosomes that will remain in the next generation until a predetermined number is reached.

実数値GAでは単純な交叉における限界として、図４のように2つの親X、Yの交叉による子X’、Y’が最適解Eに近づくとは限らないという問題があるため、本装置では平均交叉法を用いる。平均交叉法は、探索領域において2つの探索点の内分、外分となる点を生成する手法であり、2つの染色体ペアに対応する実数値ベクトルをX、Yとすると、それらの中心(X+Y)/2と、2つの外分点(3X-Y)/2および(-X+3Y)/2を生成し、それらのうち適応度の高い２つの染色体を選択する。交差点は二点交叉によって選択する。選択された2点の内部について、平均交叉法によって交叉を行う。図５に平均交叉法と二点交叉を組み合わせた交叉方法の概要を示す。 The real value GA has a problem in that simple crossover has a problem that the children X 'and Y' due to crossover of two parents X and Y do not always approach the optimal solution E as shown in Fig. 4. Use the average crossover method. The average crossover method is a method of generating points that are the inner and outer parts of two search points in the search region.If the real-valued vectors corresponding to two chromosome pairs are X and Y, their center (X + Y) / 2 and two external dividing points (3X-Y) / 2 and (-X + 3Y) / 2 are generated, and two chromosomes having high fitness are selected. The intersection is selected by two-point intersection. Crossover is performed by the average crossover method for the inside of the two selected points. FIG. 5 shows an outline of the crossover method combining the average crossover method and the two-point crossover method.

突然変異では摂動幅Mを定義し、染色体からランダムに選択した遺伝子xiを最大Mの値だけ増減させる。染色体の遺伝子数をn、変化する遺伝子をxi、xiの上限をa、下限をbとすると次のようにして突然変異を行う。
１．0以上n以下となるiを選択する。
２．正負の方向をそれぞれ0.5の確率で選択する。
３．正方向の時、上限をaとしてxiからxi + Mまで値をランダムに選択する。
４．負方向の時、下限をbとしてxi - Mからxiまでの値をランダムに選択する。 In the mutation, the perturbation width M is defined, and the gene xi randomly selected from the chromosome is increased or decreased by the maximum M value. When the number of chromosome genes is n, the changing genes are xi, the upper limit of xi is a, and the lower limit is b, mutation is performed as follows.
1. Select i that is greater than or equal to 0 and less than or equal to n.
2. Positive and negative directions are selected with a probability of 0.5 each.
3. In the positive direction, the upper limit is a, and a value is randomly selected from xi to xi + M.
4). In the negative direction, the value from xi-M to xi is selected at random with the lower limit as b.

上述した通り、分類された時系列データの特徴パタンと、それを用いた分類ルール、または自動的に改良された特徴パタンとそれを用いた分類ルールを抽出するとともに、抽出した分類ルールを用いて未分類の時系列データの分類を予測する機能が提供される。 As described above, the feature pattern of the classified time series data and the classification rule using the same, or the automatically improved feature pattern and the classification rule using the same are extracted, and the extracted classification rule is used. A function for predicting classification of unclassified time series data is provided.

ユークリッド距離とDTW距離の違い。Difference between Euclidean distance and DTW distance. トレーニングデータの例と抽出される決定木。Training data example and extracted decision tree. 特徴パタンと染色体の対応例。Example of correspondence between feature pattern and chromosome. 単純な交叉の問題。A simple crossover problem. 二点交叉と平均交叉。Two-point crossover and average crossover. 装置全体。The entire device. 理想的な実験データ作成方法。Ideal method for creating experimental data. 遺伝的アルゴリズムによるパタン改良の効果。Effect of pattern improvement by genetic algorithm.

分類ごとの時系列データの形状とそれを用いたルールを抽出するという目的のために、実際に装置を作成して実現した。提案装置の効果について（１）特徴パタンに基づいた決定木生成は有効か？（２）有効な特徴パタンは抽出可能か？（３）改良機能はどの程度有効か？の3点について調査を行う。このために2種類の実験データを用いる。提案装置の想定する理想的なデータを人工的に作成することによって効果（１）を調査する。次に実際の医療データを用いて効果（２）、（３）について調査する。 For the purpose of extracting the shape of time series data for each classification and the rules using it, the device was actually created and realized. (1) Is decision tree generation based on feature patterns effective? (2) Can effective feature patterns be extracted? (3) How effective is the improved function? Investigate three points. Two types of experimental data are used for this purpose. The effect (1) is investigated by artificially creating ideal data assumed by the proposed apparatus. Next, effects (2) and (3) are investigated using actual medical data.

図６は装置全体の概要を示す。1社の株価、1人の患者などを単位とした1つの時系列データは、クラス値に関連付けられて保存されている。まず各時系列データに対して、すべての部分時系列データのセットを固定長のスライドウィンドウによって収集する。収集された全ての部分時系列データを用いてクラスタリングを行う。次に、いくつかの代表シーケンスはTF*IDFによってフィルターされ、残ったシーケンスが特徴パタンとなる。各部分時系列データは特徴パタンとの距離を用いて表現される。これが決定木学習を使って特徴抽出する際のトレーニングデータとなる。 FIG. 6 shows an outline of the entire apparatus. One piece of time-series data based on the stock price of one company, one patient, etc. is stored in association with the class value. First, for each time series data, a set of all partial time series data is collected by a fixed-length sliding window. Clustering is performed using all the collected partial time series data. Next, some representative sequences are filtered by TF * IDF, and the remaining sequences become feature patterns. Each partial time series data is expressed using a distance from the feature pattern. This is training data when extracting features using decision tree learning.

特徴パタン抽出の際には部分時系列データ抽出のスライドウィンドウのサイズは20、スライド量を5とする。このとき、データの末端のスライドウィンドウのサイズに満たないデータは切り捨てる。DTWでは水平コスト（時間軸のずれのコスト）を50.0、垂直コスト（計測値のずれ）のコストを値の差の絶対値とした。決定木学習にはC4.5を用いた。パタン改良の効果を調査するために、遺伝的アルゴリズムで100世代分の計算し、各世代における予測精度を求める。決定木の分類精度は10分割交差検定によって測定する。 When extracting feature patterns, the size of the slide window for partial time series data extraction is 20 and the slide amount is 5. At this time, data less than the size of the sliding window at the end of the data is discarded. In DTW, the horizontal cost (cost of time axis deviation) is 50.0, and the cost of vertical cost (measurement value deviation) is the absolute value of the difference in values. C4.5 was used for decision tree learning. In order to investigate the effect of pattern improvement, 100 generations are calculated using a genetic algorithm, and the prediction accuracy in each generation is obtained. The classification accuracy of the decision tree is measured by 10-fold cross validation.

まず、理想的な実験データの作成方法について説明する。まず優先度と未来状態と特徴パタンの組をあらかじめ5種類用意する。それらの特徴パタンと未来状態を設定していないランダムなパタン５種類をランダムに配置することで時系列データを作成する。図７に理想的な実験データの作成方法の概要を示す。作成した時系列データを20種類用意して実験を行う。停滞幅倍率は5%とした。 First, an ideal experimental data creation method will be described. First, five types of combinations of priority, future state, and feature pattern are prepared in advance. Time-series data is created by randomly arranging five types of feature patterns and random patterns for which future states are not set. FIG. 7 shows an outline of a method for creating ideal experimental data. Prepare 20 types of created time-series data and conduct experiments. The stagnation width magnification was 5%.

作成した時系列データから、単純に部分時系列データの各時刻の値を属性とした場合と、テストデータ作成に用いた特徴パタンを用いた場合と、ランダム生成した特徴パタンを用いた場合の3種類で決定木学習を行い比較する。 From the created time-series data, the time value of partial time-series data is simply used as an attribute, the feature pattern used for test data creation, and the randomly generated feature pattern are used. Perform decision tree learning by type and compare.

精度と分散、そして決定木サイズの実験結果を表１に示す。Simpleは、単純に部分時系列データの各時刻の値を属性とした決定木である。Known patternsは、テストデータ作成に用いた特徴パタンを属性とした決定木である。GA ＋ Known patternsは、テストデータ作成に用いた特徴パタンを属性としたあと遺伝的アルゴリズムによって改良した決定木である。Random patternsは、ランダム生成した特徴パタンを属性とした決定木である。GA ＋ Random patternsは、ランダム生成した特徴パタンを属性としたあと、遺伝的アルゴリズムによって改良した決定木である。特徴パタンを用いた手法では遺伝子の数を5、10、20と変えて測定する。表中の数値はすべて20種類のデータの平均値である。獲得した決定木の安定性を調べるために予測精度の分散も示す。
Table 1 shows the experimental results of accuracy, variance, and decision tree size. Simple is a decision tree whose attribute is simply the value of each time of partial time-series data. Known patterns is a decision tree with the feature pattern used for test data creation as an attribute. GA + Known patterns is a decision tree that has been improved by a genetic algorithm after using the feature pattern used for test data creation as an attribute. Random patterns are decision trees that have randomly generated feature patterns as attributes. GA + Random patterns are decision trees that are improved by a genetic algorithm after using randomly generated feature patterns as attributes. In the method using the characteristic pattern, the number of genes is changed to 5, 10, and 20 for measurement. All numbers in the table are average values of 20 types of data. We also show the variance of prediction accuracy to investigate the stability of the acquired decision tree.

表１から、SimpleとRandom patternsを比較するとそれほど予測精度に違いがなく、クラスタの詳細度が上がるごとに多少の精度向上があることが分かる。ただし、分散という点で従来手法よりもパタンを用いたほうが安定していることが分かる。また、Known patternsに関しては全ての設定で従来手法を上回る予測精度を得られた。こちらの実験でもクラスタの詳細度が上がるほど予測精度が上がることが分かる。 From Table 1, it can be seen that there is not much difference in prediction accuracy when comparing Simple and Random patterns, and there is some improvement in accuracy as the level of detail of the cluster increases. However, it can be seen that using the pattern is more stable than the conventional method in terms of dispersion. In addition, with regard to Known patterns, prediction accuracy exceeding the conventional method was obtained in all settings. This experiment also shows that the prediction accuracy increases as the level of detail of the cluster increases.

GAを行う前と後では予測精度が数ポイント向上しており、大きな効果ではないが少なからず予測精度を高める効果があることが分かる。以下の実際の医療データを用いた実施結果でGAについての詳細な考察を行うが、人工的に作られたデータでは特徴パタン抽出によって高い予測精度を獲得できたため、GAによる精度向上が伸び悩んだと考えられる。 Before and after the GA, the prediction accuracy has improved by several points, which is not a big effect, but it can be seen that it has the effect of improving the prediction accuracy. The detailed results of GA using the following actual medical data will be considered in detail. However, with artificially generated data, high prediction accuracy could be obtained by extracting feature patterns, so the improvement in accuracy by GA was sluggish. Conceivable.

提案した装置の有効性をさらに調査するために、実際の医療データを用いて実験を行った。使用するデータは千葉大学付属病院から提供されている慢性肝炎の検査データで、このデータは1982年から2001年までの20年間で771名についての検査の記録である。このデータから血小板の時系列データ（PLT）のみを用いて慢性肝炎のタイプを予測する。慢性肝炎のタイプにはＢ型とＣ型の2種類があるが、Ｃ型肝炎については治療のためにインターフェロンの投与が行われた患者データもあるため、予測するクラスは「Ｂ型肝炎」「Ｃ型肝炎インターフェロン投与無し」「Ｃ型肝炎インターフェロン投与有り」の3種類とした。 To further investigate the effectiveness of the proposed device, experiments were performed using actual medical data. The data used is the test data for chronic hepatitis provided by the Chiba University Hospital. This data is a record of tests for 771 people over the 20 years from 1982 to 2001. From this data, the type of chronic hepatitis is predicted using only time series data (PLT) of platelets. There are two types of chronic hepatitis, type B and type C, but there are data on patients who received interferon administration for treatment of hepatitis C, so the predicted class is "hepatitis B" There were three types: “no hepatitis C interferon administration” and “with hepatitis C interferon administration”.

比較のために、従来手法としてC4.5、サポートベクタマシン（以下SVM）、ニューラルネットワーク（以下、NN）を用いてスライドウィンドウによって手に入れた部分時系列データの各時間を各属性値として使うことで分類予測を行う。これらの従来手法をダイレクトアプローチ（以下、DA）として標記する。提案手法では特徴抽出でのクラスタ数を10、20、30、40、50の5種類で実験する。表２に予測精度の結果を示す。
For comparison, each time of partial time series data obtained by sliding window using C4.5, support vector machine (hereinafter SVM), neural network (NN) as conventional method is used as attribute value. The classification is predicted. These conventional methods are labeled as direct approach (DA). In the proposed method, the number of clusters in feature extraction is tested with 5, 20, 30, 40, and 50 types. Table 2 shows the results of prediction accuracy.

特徴パタン抽出だけを用いた場合、クラスタ数を40以上に設定することですべての既存手法よりも高い分類精度が得られた。しかし、GAによるパタン改良後の分類精度をみるとすべてのクラスタ数の設定において、すべての既存手法よりも高い分類精度を獲得できることが分かる。これらのことから、GAによるパタン改良機能は初期の分類精度が低い問題に対して有効に働くと考えられる。 When only feature pattern extraction was used, the classification accuracy was higher than all existing methods by setting the number of clusters to 40 or more. However, looking at the classification accuracy after pattern improvement by GA, it can be seen that higher classification accuracy than all existing methods can be obtained in all cluster number settings. From these facts, the pattern improvement function by GA is considered to work effectively for problems with low initial classification accuracy.

図８にGAによる世代ごとの分類精度の変化を示す。グラフで示す通り、5種類すべてのクラスタ数の設定が高い予測精度を得ていることがわかる。また、クラスタ数が低いほど最初の精度が悪いため、GAによる改良効果が高いと思われる。これらのことから、GAによる改良は、十分な特徴パタンの数を確保できずに精度が低くなる時ほど有効であり、最終的に安定して高い予測精度を出力する効果があるといえる。 FIG. 8 shows changes in classification accuracy by generation by GA. As shown in the graph, it can be seen that the setting of the number of all five types of clusters has high prediction accuracy. Also, since the initial accuracy is worse as the number of clusters is lower, the improvement effect by GA seems to be higher. From these facts, it can be said that the improvement by GA is more effective when the accuracy becomes lower without securing a sufficient number of feature patterns, and finally has the effect of stably outputting high prediction accuracy.

本発明で対象としている時系列データは経済、医療、科学など、現在様々な分野で頻出する基礎的なデータであるため、幅広い分野において適用可能な技術である。 Since the time-series data targeted by the present invention is basic data that frequently appears in various fields such as economy, medical care, and science, it is a technique that can be applied in a wide range of fields.

（ａ）ユークリッド距離のポイント対応例。
（ｂ） Dynamic Time Warpingのポイント対応例。
（ｃ）特徴パタンを基にしたトレーニングデータの例。
（ｄ）出力される決定木の例。
（ｅ）変化率に正規化した特徴パタン。
（ｆ）特徴パタンに対応する染色体。
（ｇ）選択された２つの親染色体。
（ｈ）交叉の結果作成された子染色体。
（ｉ）優先度と未来状態を組にした特徴パタン。
（ｊ）特徴パタンを基にして作成された理想的な実験データ。 (A) Euclidean distance point correspondence example.
(B) Dynamic Time Warping point correspondence example.
(C) Example of training data based on feature patterns.
(D) An example of an output decision tree.
(E) Feature pattern normalized to rate of change.
(F) Chromosome corresponding to the characteristic pattern.
(G) Two selected parental chromosomes.
(H) A child chromosome created as a result of crossover.
(I) A feature pattern that combines priority and future state.
(J) Ideal experimental data created based on characteristic patterns.

Claims

An apparatus for extracting, as a feature pattern, a change in a partial section of time-series data that characterizes the classification of each time-series data from a time-series database that can distinguish time-series data classified by a certain concept.

The apparatus according to claim 1, which has a function of extracting a classification rule for classifying unclassified time-series data into an appropriate concept group based on the feature pattern extracted in claim 1.

3. The apparatus according to claim 2, wherein the time series data that is actually unclassified is classified into an appropriate concept group in accordance with the classification rule extracted in claim 2.

The apparatus according to claim 2, which has a function of improving the classification accuracy of the classification rule extracted in claim 2 by automatically improving the feature pattern extracted in claim 1.