JPH02158871A

JPH02158871A - Document sorting device

Info

Publication number: JPH02158871A
Application number: JP63312107A
Authority: JP
Inventors: Tetsuya Morita; 哲也森田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1988-12-12
Filing date: 1988-12-12
Publication date: 1990-06-19

Abstract

PURPOSE:To sort documents based upon the number of concepts without generating dispersion by providing the device with a keyword information volume storing means, a concept feature extracting means and an inter-document distance calculating means. CONSTITUTION:The keyword information volume storing means 1 finds out the self-information volume of each keyword by prescribed calculation based upon the keyword appearance frequency of a document data base or the like, the concept feature extracting means 2 finds out the number of concept features in each document from the self-information volume and the inter-document distance calculating means 3 sorts documents in accordance with a different between plural concept feature values. Thus, the documents are automatically sorted through the calculation processing of respective means in accordance with the frequency of respective keywords. Consequently, ordinary manual operation can be omitted and document sorting having no dispersion and based upon the number of concept can be constructed.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は文書分類装置、とくに、文書に含まれるキーワ
ードに基き文書の概念特徴量を求め、概念特徴量により
文書を分類する文書分類装置に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a document classification device, and more particularly, to a document classification device that obtains conceptual features of a document based on keywords included in the document and classifies documents based on the conceptual features. .

［従来の技術］文書をあらかじめ設定した分野へ自動的に分類するため
カイ自乗値を用いてキーワードの偏りを調べ、文書を分
類する方式が知られている。このような分類方式を記載
したものとして、田村他「統計的手法による文書自動分
類」　（情報処理３６回全国大会論文集、１９８７年）
、オよび林知己夫「数量化の方法」（東洋経済新聞社、
１９７４年）がある。[Prior Art] In order to automatically classify documents into predetermined fields, a method is known in which a chi-square value is used to check the bias of keywords and classify documents. Such a classification method is described in Tamura et al., "Automatic classification of documents using statistical methods" (Proceedings of the 36th National Conference on Information Processing, 1987).
, O and Chikio Hayashi, “Methods of Quantification” (Toyo Keizai Shimbun,
1974).

カイ自乗検定はキーワードの出現頻度の分野による偏り
を示す指標としてカイ自乗値を求め文書を分類するもの
である。カイ自乗値は、各キーワードの出現頻度値と各
分野ごとの総キーワード数か独ケ事象であると仮定した
場合のキーワー・ドの出現頻度値を理論度数とし、実測
値との差を求め正規化したものである。The chi-square test is a method for classifying documents by determining the chi-square value as an index indicating the bias in the appearance frequency of keywords depending on the field. The chi-square value is calculated by using the frequency value of each keyword and the total number of keywords in each field or the frequency value of the keyword, assuming that it is a unique phenomenon, as the theoretical frequency, and calculating the difference between the actual value and the normal value. It has become.

上記の文献■はカイ自乗検定を用いて文書をあらかじめ
設定した分野へ自動的に分類する方式について述べたも
のである。この方式は、キーワードの出現頻度の偏りを
用いるために、あらかじめ大量の標本データを分野別に
分類してカイ自乗値を計算し、分類用データを用意して
おく必要かある。The above-mentioned document (■) describes a method for automatically classifying documents into predetermined categories using a chi-square test. Since this method uses the bias in the appearance frequency of keywords, it is necessary to prepare data for classification by classifying a large amount of sample data in advance by field and calculating chi-square values.

文献■もやはりカイ自乗値を用いる統計的手法の−って
あり、複数の分野間の相関を見るための方式である。Reference (2) is also a statistical method that uses chi-square values, and is a method for looking at correlations between multiple fields.

［発明が解決しようとする課題］上記の文献■■に記載された方式は、標本データの分類
にはやはり人手による作業が必要となる。したがって、
人手による分類のばらつきや不適切さが介入するという
問題がある。[Problems to be Solved by the Invention] The method described in the above-mentioned document ■■ still requires manual work to classify sample data. therefore,
There is a problem of interference due to variations and inappropriateness of manual classification.

また、後者は分類用の軸を決定するのが難しいという問
題かある。In addition, the latter has the problem that it is difficult to determine the axis for classification.

本発明は上記の問題点を解決するために、文書に含まれ
るキーワードの頻度値から各文書の概念特徴是な求め、
これに応じて文書を分類する文書分類装置を提供するこ
とを目的とする。In order to solve the above problems, the present invention calculates the conceptual characteristics of each document from the frequency values of keywords included in the document.
It is an object of the present invention to provide a document classification device that classifies documents accordingly.

［課題を解決するための手段］上記目的を達成するために、本発明によれば、文書デー
タベースにおけるキーワードの出現頻度値を用いて計算
される各キーワードの自己情報量を保持するキーワード
情報量記憶手段と、キーワードの自己情報量を用いて各
文書ごとの概念性微量を求める概念特徴抽出手段と、文
書間の概念性微量の差に応じて文書間の距離を求める文
書間距離計算手段とを有する。[Means for Solving the Problems] In order to achieve the above object, the present invention provides a keyword information storage that holds the self-information amount of each keyword calculated using the appearance frequency value of the keyword in the document database. a conceptual feature extraction means for calculating the conceptuality trace amount of each document using the self-information amount of the keyword; and an inter-document distance calculation means for calculating the distance between documents according to the difference in the conceptuality trace amount between documents. have

文書間距離計算手段は、文書間の距離によって文書の分
類を行う。The inter-document distance calculation means classifies documents based on the distance between documents.

［作　用］本発明によれば、キーワード情報量記憶手段か文書デー
タベース等のキーワード出現頻度により、所定の計算を
行って各キーワードの自己情報量を求め、概念特徴抽出
手段が自己情報量より所定の計算により各文書の概念性
微量を求め、文書間距離計算手段か概念性微量の差に応
じて文書の分類を行なう９以上のようにキーワードの頻
度より各手段の計算処理を通して、自動的に文書か分類
されるので、従来の人手作業が不要となり、ばらつきの
ない、概念量による文書分類が構築できる。[Function] According to the present invention, a predetermined calculation is performed to obtain the self-information amount of each keyword based on the keyword appearance frequency in the keyword information storage means or a document database, and the concept feature extraction means extracts a predetermined amount of self-information from the self-information amount. The conceptuality trace amount of each document is determined by the calculation of Since documents are classified, conventional manual work is no longer required, and document classification based on conceptual quantities without variation can be constructed.

［実施例］本発明の実施例を図面を用いて具体的に説明する。[Example] Embodiments of the present invention will be specifically described with reference to the drawings.

本発明による文書分類装置の一実施例が図に示されてい
る。An embodiment of a document classification device according to the invention is shown in the figure.

キーワード情報量記憶部ｌは入力される未登録文６Ｑよ
りキーワードを抽出し、後述のようにその出現頻度より
キーワードの出現確率を求め、その対数値をキーワード
情報量Ｉとして記憶する。The keyword information storage unit 1 extracts keywords from the input unregistered sentence 6Q, calculates the probability of appearance of the keyword from its appearance frequency as described later, and stores the logarithm value as the keyword information amount I.

概念特徴抽出部２はキーワード情報量記憶部１よリキー
ワード情報量工を入力し、その総和を文書Ｑの概念性微
量Ｃ（ｑ）として出力する。文書間距離計算部３は概念
特徴抽出部２より各文書の概念性微量Ｃ（ｑ）を入力し
て記憶し、２つの文書間の概念距離を求めて、概念距離
の近い文書をクラスタ（分類）して、各種の分類を文書
データベース４に格納する。各機能部は、各部の生成し
たデータを転送するデータバスａ−Ｃによって接続され
ている。The conceptual feature extraction unit 2 inputs the keyword information amount from the keyword information storage unit 1, and outputs the sum as the conceptual trace amount C(q) of the document Q. The inter-document distance calculation unit 3 inputs and stores the conceptual trace amount C(q) of each document from the conceptual feature extraction unit 2, calculates the conceptual distance between two documents, and clusters (classifies) documents with close conceptual distances. ) and store various classifications in the document database 4. Each functional unit is connected by a data bus a-C that transfers data generated by each unit.

一般にシソーラス等のキーワード集に登録されているキ
ーワードは、それらが現われる文書数や全文書における
延べ出現回数等によって各キーワードの出現頻度を定義
できる。いまキーワードＫＥＹ　ｉの出現頻度を全キー
ワード数で正規化したＰ　をキーワードＫＥＹ　ｉの出
現確率とすると、キーワードに出現確率Ｐ　を対応させ
るシステムは完■ 全事象系となり以下のように表せる。In general, for keywords registered in a keyword collection such as a thesaurus, the appearance frequency of each keyword can be defined by the number of documents in which the keyword appears, the total number of appearances in all documents, and the like. Now, if P, which is the appearance frequency of keyword KEY i normalized by the total number of keywords, is taken as the appearance probability of keyword KEY i, then the system that associates the appearance probability P with keywords is a complete event system and can be expressed as follows.

たたし、　　ΣＰ＝１　　である。However, ΣP=1.

１；１ここで、ＫＥＹｉの自己情報量Ｉ　（ＫＥＹｉ）は次式
で表せる。1;1 Here, the self-information amount I (KEYi) of KEYi can be expressed by the following equation.

１　　（ＫＥＹｉ　　）　　＝　−ｌｏｇ　　Ｐｉ　　
　　　　　＝・　（１）また自己情報量は加法性を保つ
ため、ＫＥＹｉとＫＥＹｊの持つ合成情報量は、次式で
表わされる。1 (KEYi) = -log Pi
=. (1) Since the amount of self-information maintains additivity, the amount of combined information possessed by KEYi and KEYj is expressed by the following equation.

１　（ＫＥＹｉ、ＫＥＹｊ）　＝　　Ｉ　（ＫＥＹｉ）
　＋　　ｌ　（ＫＥＹｊ）＝　−ｌｏｇ　Ｐｉ　−ｌｏ
ｇ　Ｐｊ　　−（２）キーワード情報量記憶部１は、文
書データベース４への未登録文書Ｑを概念特徴抽出部２
を介してデータバスａより入力し１文書Ｑの各キーワー
ドを抽出し、その出現確率にＥＹｉを求め、（１）式に
よりキーワードの自己情報量　１（にＥＹｉ）を計算し
て保持する。シソーラスか用意されているときは、シソ
ーラスのキーワード分類項目ごとにキーワードの出現確
率を求め、（１）式により自己情報量な計算できる。1 (KEYi, KEYj) = I (KEYi)
+ l (KEYj) = -log Pi -lo
g Pj - (2) The keyword information storage unit 1 stores the unregistered document Q in the document database 4 from the conceptual feature extraction unit 2.
Each keyword of one document Q is inputted from the data bus a via the data bus a, and EYi is determined as its appearance probability, and the self-information amount 1 (EYi) of the keyword is calculated and held using equation (1). When a thesaurus is prepared, the probability of appearance of a keyword is determined for each keyword classification item in the thesaurus, and the amount of self-information can be calculated using equation (1).

ある文書Ｑのキーワード集合なｑとしその概念特徴量を
（：（ｑ）と表すと、で与えられる。Let q be a keyword set of a certain document Q, and express its conceptual feature as (:(q)), then it is given by.

また既存の分類項目を持つシソーラスにおいては概念特
徴量をベクトルとして扱うことができる。最も単純な例
として、Ｍ個の分類項目を持つシソーラスではＭ次元の
ベクトルＣｖを考える。Furthermore, in a thesaurus with existing classification items, conceptual features can be treated as vectors. As the simplest example, consider an M-dimensional vector Cv in a thesaurus having M classification items.

今、Ｒ番目の分類項目に属するキーワードの集合をｒと
すると、文書Ｑの概念特徴量ベクトルＣＶ（ｑ）のＲ要
素ＣＶｒ（ｑ）は、ただし、　ｉε　ｑｎｉε　ｒ　はキーワードｉが文書
Ｑ中に含まれ、かつＲ番目の分類項目中に含まれている
場合のＰｉの総和を計算することを意味する。Now, if the set of keywords belonging to the R-th classification item is r, then the R element CVr(q) of the conceptual feature vector CV(q) of document Q is: This means calculating the sum total of Pi when both are included and included in the R-th classification item.

キーワード情報量記憶部ｌから文書Ｑの各キーワードの
自己情報量■を入力し、概念特徴抽出部２は、（３）式
または（４）式を用いて、概念特徴量Ｃ（ｑ）またはＣ
Ｖｒ（ｑ）を計算し、データバスｂより文書間距離計算
部３に出力する。The self-information amount ■ of each keyword of the document Q is inputted from the keyword information amount storage unit l, and the conceptual feature extraction unit 2 extracts the conceptual feature amount C(q) or C using equation (3) or equation (4).
Vr(q) is calculated and output to the inter-document distance calculation unit 3 via data bus b.

（３）式によって求められた概念情報量はある文書のも
つキーワード情報量の和であり、その文書に付加された
自己情報量の大きさを示しているだけである。この場合
の概念情報量は、文書データベースの検索時における当
該文書の分離度の高さ（同定しやすさ）を表す、このよ
うな分離度の高さによって文書を分類することも可能で
ある。The amount of conceptual information obtained by equation (3) is the sum of the amount of keyword information that a certain document has, and only indicates the amount of self-information added to that document. In this case, the amount of conceptual information indicates the degree of separation (ease of identification) of the document at the time of searching the document database, and it is also possible to classify documents based on the degree of separation.

しかし、通常は文書の内容によって既存の分類項目等に
分類する用途が考えられる。そのような場合、　（４）
式の概念特徴値ベクトルを用いる。−般にＭ個の分類項
目によってデータベースはＭ次元の概念空間を構成する
と考えられる。従ってこのようなデータベース中の文書
の持つ概念は、Ｍ個の特徴パラメータからなるＭ次元ベ
クトルとして表現できる。また任意の２つの概念特徴量
ベクトルの距離が計算できるため、ある文書のある分類
への帰属度や２つの文書間の概念的距離等が求められる
。However, it is usually possible to classify documents into existing classification categories depending on their contents. In such a case, (4)
Using the conceptual feature value vector of Eq. - In general, a database is considered to constitute an M-dimensional conceptual space with M classification items. Therefore, the concept of a document in such a database can be expressed as an M-dimensional vector consisting of M feature parameters. Furthermore, since the distance between any two conceptual feature vectors can be calculated, the degree of belonging of a certain document to a certain classification, the conceptual distance between two documents, etc. can be determined.

例えば、ＧＶ（ｑ）という概念特徴量ベクトルを持つ文
書が、キーワード集合ｋをもつ分類Ｋに帰属する度合を
ＩＮＣ（ｋ、Ｑ）とすると、鋪ＩＮＧ（ｋ、ｑ）＝　ＣＶｋ（ｑ）／　Σ　　ＣＶｒ（
ｑ）　　　　　・・・　（５）Ｊで与えられる。For example, if INC(k, Q) is the degree to which a document with a conceptual feature vector GV(q) belongs to classification K with keyword set k, then ING(k, q) = CVk(q)/ ΣCVr(
q) ... (5) Given by J.

また、ＣＶ（ｓ）　、　ＣＶ（ｔ）という概念特徴量ベ
クトルを持つ２つの文書間の概念距離なり（ｓ、ｔ）と
し例えば市街地距離で計算すると、補Ｄ（ｓ、ｔ）＝　　Σ　ｌ　ＣＶｒ（ｓ）　−（：Ｖｒ
（ｔ）　ｌ　　　　・・・（６）ｒ寓１で与えられる。Furthermore, if we assume that the conceptual distance between two documents with conceptual feature vectors CV(s) and CV(t) is (s, t), and calculate it using the city area distance, for example, D(s, t) = Σ l CVr (s) −(:Vr
(t) l ... (6) r is given by 1.

文書間距離計算部３は概念特徴量Ｃ（ｑ）またはＧＶ（
Ｑ）を入力し、（５）式で示した計算を行なうことによ
り、未分類の文書の属すべき分類を決定でき、また（６
）式を用いると、概念距離の近い文書群によっていくつ
かの分類を構成できる０文書間距離計算部３は文書Ｑの
分類を文書データベース４に入力する。このとき生成さ
れる分類は、既存のいくつかの分類項目の概念を結合し
た合成概念になるため、既存の分類項目に捕われない文
書概念自体に指向した新しい分類体系を自然に構築して
いく。The inter-document distance calculation unit 3 calculates the conceptual feature amount C(q) or GV(
By inputting Q) and performing the calculation shown in equation (5), it is possible to determine the classification to which an unclassified document belongs.
) formula allows several classifications to be formed by groups of documents with close conceptual distances.The inter-document distance calculation unit 3 inputs the classification of the document Q into the document database 4. Since the classification generated at this time is a composite concept that combines the concepts of several existing classification items, a new classification system that is oriented to the document concept itself and is not bound by existing classification items is naturally constructed.

（６）式を用いた同類文書の分類方法について具体的に
説明する。A method for classifying similar documents using equation (6) will be specifically explained.

前述のように既存の分類項目に対して文書分類を行なう
場合には、（５）式を用いて各分類Ｋに帰属する度合い
ＩＮＣ（ｋ、ｑ）を求めればよい、さらに概念特徴量ベ
クトルを用いると、既存の分類項目を用いて新しい分類
体系を構築することが可能となる。As mentioned above, when performing document classification on existing classification items, it is sufficient to calculate the degree of belonging to each classification K (INC(k, q)) using equation (5). When used, it becomes possible to construct a new classification system using existing classification items.

まず、分類しようとする全ての文書について各文書間の
概念距離りを求める０次に全ての文書の中から任意に１
文書（文書Ｓとする）を選択し、その文書との概念距離
が所定のしきい値より小さい、すなわちその文書と概念
的に近い文書を抽出する。抽出された文書Ｔの集合を式
で表現すると１文書Ｓ、Ｔに含まれるキーワード集合を
それぞれｔ、ｓとすれば、（ＴＩＤ　（ｓ、ｔ）＜θ）（ただし、Ｄ　（ｓ、５）＝０は（６）より明らかであ
り、文書Ｓは必ず集合Ｔに含まれる。）この作業を全ての文書に対して行なうと文書数に等しい
同類文書の集合が出来上がる。これら同類文書集合をそ
の集合の要素数（文書数）に従って降順に並べ１文書数
の多い順に必要な分類数だけの同類文書集合を選択する
。この選択は分類数で制限しても良いし、文書数で制限
しても良い。First, calculate the conceptual distance between each document for all documents to be classified.
A document (document S) is selected, and documents whose conceptual distance to the document is smaller than a predetermined threshold, that is, documents that are conceptually similar to the document are extracted. Expressing the set of extracted documents T using a formula, if one document S and the keyword sets included in T are t and s, respectively, (TID (s, t) < θ) (where D (s, 5) =0 is clear from (6), and the document S is always included in the set T.) If this operation is performed for all documents, a set of similar documents equal to the number of documents will be created. These similar document sets are arranged in descending order according to the number of elements (number of documents) in the set, and similar document sets with the required number of classifications are selected in descending order of the number of documents. This selection may be limited by the number of classifications or the number of documents.

分類可能な数の最大値は文書数である。この場合各分類
に含まれる文書数はｌであるが、このような分類が最適
となる場合もあってしかるべきである。The maximum number that can be classified is the number of documents. In this case, the number of documents included in each classification is l, but there may be cases where such classification is optimal.

［発明の効果］本発明によれば、キーワード抽出、または既存のキーワ
ード集の分類を用いて概念特徴量を計算できるため、未
登録文書の分類の前に評価用データを作成する必要がな
い。[Effects of the Invention] According to the present invention, conceptual features can be calculated using keyword extraction or classification of existing keyword collections, so there is no need to create evaluation data before classifying unregistered documents.

概念距離の近い文書群によって分類を構成するため、既
存の分類項目に捕われない文書概念自体に指向した新し
い分類体系を自然に構築していくという優れた効果があ
る。Since a classification is constructed from a group of documents with close conceptual distance, it has the excellent effect of naturally constructing a new classification system that is oriented to the document concept itself and is not limited by existing classification items.

[Brief explanation of the drawing]

図は本発明の文書分類装置の一実施例を示す機能ブロッ
ク図である。要部分の符号の説明ｌ・・・キーワード情報量記憶部、２・・・概念特徴抽出部、３・・・文書間距離計算部、４・・・文書データベース。The figure is a functional block diagram showing an embodiment of the document classification device of the present invention. Explanation of symbols of important parts 1: Keyword information storage unit, 2: Conceptual feature extraction unit, 3: Inter-document distance calculation unit, 4: Document database.

Claims

[Scope of Claims] 1. Keyword information storage means for storing the self-information amount of each keyword calculated using the appearance frequency value of the keyword in the document database; conceptual feature extraction means for calculating the conceptual feature amount of the document; and inter-document distance calculation means for calculating the distance between documents according to the difference in the conceptual feature amount between the documents; A document classification device characterized by classifying documents based on the distance between them. 2. Keyword information storage means for storing the self-information amount of keywords calculated using the appearance frequency value of keywords for each keyword classification item of the thesaurus used in the document database, and the keyword information for each keyword classification item. a conceptual feature extraction means for obtaining a vectorized sum of quantities as a conceptual feature; and an inter-document distance calculation means for obtaining a distance between documents according to a difference in the conceptual feature between documents; A document classification device characterized in that the distance calculation classifies documents based on the distance between the documents.