JPH1153381A

JPH1153381A - Device and method for retrieving similar document

Info

Publication number: JPH1153381A
Application number: JP9208039A
Authority: JP
Inventors: Yasuo Tanosaki; 康雄田野崎; Naohide Kubota; 直秀久保田; Yukio Nakamoto; 幸夫中本; Takuya Nishina; 卓哉仁科
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1997-08-01
Filing date: 1997-08-01
Publication date: 1999-02-26

Abstract

PROBLEM TO BE SOLVED: To retrieve the similar documents in consideration of their frequency of occurrence and also to simultaneously retrieve plural similar retrieving and retrieved object documents with high accuracy and high efficiency by calculating the similarity between both document data based on a word frequency table and a document norm table. SOLUTION: A main processing part 11a takes a word frequency table out of a retrieving object/word frequency table store buffer 11j and a retrieved object/word frequency table store buffer 11k respectively and transfers its processing to a word correspondence table production part 11d. When a word correspondence table is produced at the part 11d and stored in a word correspondence table store buffer 11m, the part 11a transfers its processing to a similarity calculation part 11e. The part 11e refers to the word correspondence table and a document norm table to calculate the similarity between the retrieving and retrieved object documents for each paragraph of word frequency tables of both retrieving and retrieved object documents. Then the similarity is calculated between both documents by calculating the average similarity value of paragraphs, etc., and this calculation result is outputted to an external storage.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、電子化された文書
データの検索装置に係り、特にある文書データに対して
これと類似した文書データを検索する類似文書検索装置
および類似文書検索方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an electronic document data search apparatus, and more particularly to a similar document search apparatus and a similar document search method for searching document data similar to certain document data.

【０００２】[0002]

【従来の技術】近年、大量の電子化された文書データが
流通するようになり、自動分類等を行う目的で、文書デ
ータベース中から指定された文書に類似する文書を自動
検索するシステムが実用されてきている。これらのシス
テムでは、指定された文書（これを検索対象文書と呼
ぶ。）の類似文書を検索するにあたって、この検索対象
文書と他の文書（これを被検索対象文書と呼ぶ。）との
間で共通する単語の種類数を反映した類似度を計算し、
この値の大きなものを検索結果として出力する。2. Description of the Related Art In recent years, a large amount of electronic document data has been distributed, and a system for automatically retrieving a document similar to a designated document from a document database for the purpose of automatic classification and the like has been put into practical use. Is coming. In these systems, when searching for a similar document of a specified document (this is referred to as a search target document), a search is made between this search target document and another document (this is referred to as a search target document). Calculate similarity that reflects the number of common word types,
The one with the larger value is output as a search result.

【０００３】例えば、図２７に示すように、予めいくつ
かの単語ｉ１，ｉ２，…，ｉｎを一次の検索条件として
指定しておき、文書データベースに格納されたｍ個の文
書データｂ１，ｂ２，…，ｂｍについて、該文書中にお
ける個々の一次検索条件単語ｉ１，ｉ２，…，ｉｎの有
無を調べる。このように各文書についての単語有無表を
作成した後、上記一次検索条件単語ｉ１，ｉ２，…，ｉ
ｎの中で検索対象文書中に存在するいくつかの単語を二
次検索条件として指定し、この単語をより多く含む文書
を上記単語有無表を参照することによって判定し、これ
を類似文書の検索結果として出力する。For example, as shown in FIG. 27, several words i1, i2,..., In are specified in advance as primary search conditions, and m pieces of document data b1, b2, .., Bm, the presence or absence of individual primary search condition words i1, i2,. After creating the word presence / absence table for each document as described above, the primary search condition words i1, i2,.
n, a number of words existing in the search target document are designated as secondary search conditions, and a document containing more words is determined by referring to the above-mentioned word presence / absence table. Output as result.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、このよ
うな従来の類似文書検索方式では、被検索対象文書中に
検索条件となる単語の有る／無しを判定基準として類似
検索を行っているにすぎないため、被検索対象文書中の
検索条件単語の持つ重みまでは考慮されないものとな
り、常により多くの種類の単語を含んだ長文の文書が類
似文書の候補として選ばれやすくなるなど、精度的に難
があることが指摘されている。また、上記従来の方式で
は、類似文書の検索を行おうとする度に検索対象文書中
に存在する、該文書の特徴をよく表すような単語を二次
検索条件として正しく指定し、そうでない種類の単語は
二次検索条件から必ず除外する必要があり、二次検索条
件の選定の際に、例えばオペレータが検索対象文書を参
照してキーとなる単語を見付け出すなどの人為的な労力
を要するなどの問題もあった。However, in such a conventional similar document search system, similar search is performed only based on the presence / absence of a word serving as a search condition in the search target document. Therefore, the weight of the search condition word in the search target document is not taken into account, and a long document containing more types of words is always easily selected as a similar document candidate. It has been pointed out that there is. Further, in the above-described conventional method, every time a search for a similar document is performed, a word that exists in the search target document and that well represents the characteristics of the document is correctly specified as a secondary search condition. Words must always be excluded from secondary search conditions, and when selecting secondary search conditions, for example, an operator needs to refer to the search target document to find a key word, which requires artificial labor. There was also a problem.

【０００５】本発明はこのような課題を解決するために
なされたもので、検索精度に優れ高速処理が可能な類似
文書検索装置および類似文書検索方法を提供することを
目的としている。The present invention has been made to solve such a problem, and an object of the present invention is to provide a similar document search device and a similar document search method which have high search accuracy and can perform high-speed processing.

【０００６】[0006]

【課題を解決するための手段】上記した目的を達成する
ために、請求項１記載の発明の類似文書検索装置は、複
数の文書データを格納する格納手段と、前記格納手段に
格納された文書データごとに、予め設定された検索条件
単語ごとの出現頻度を求めて単語頻度表を作成する単語
頻度表作成手段と、前記格納手段に格納された文書デー
タごとに、前記単語頻度表作成手段により作成された単
語頻度表中の検索条件単語ごとの出現頻度を要素とする
１次元ベクトルのノルムを算出して文書ノルム表を作成
する文書ノルム表作成手段と、前記格納手段に格納され
た文書データの中から検索対象および被検索対象の文書
データを指定する指定手段と、前記指定手段により指定
された各文書データ間の類似度を、前記単語頻度表およ
び前記文書ノルム表に基づき算出する類似度算出手段と
を有することを特徴とする。According to another aspect of the present invention, there is provided a similar document search apparatus comprising: a storage unit for storing a plurality of document data; and a document stored in the storage unit. For each data, a word frequency table creating means for finding a frequency of appearance for each preset search condition word to create a word frequency table, and for each document data stored in the storage means, the word frequency table creating means Document norm table creating means for calculating a norm of a one-dimensional vector having an appearance frequency for each search condition word in the created word frequency table as an element to create a document norm table, and document data stored in the storage means Means for designating the document data to be searched and searched for from among the document data; and the similarity between the document data designated by the designating means, the word frequency table and the document norm. And having a similarity calculation means for calculating, based on the.

【０００７】本発明によれば、文書データ中の検索条件
単語の出現頻度を考慮した類似文書検索が可能となり、
また、同時に複数の検索対象文書と複数の被検索対象文
書を対象とした類似文書検索を高効率にかつ高精度に行
うことが可能となる。According to the present invention, a similar document search can be performed in consideration of the frequency of occurrence of search condition words in document data.
Further, it becomes possible to perform a similar document search for a plurality of search target documents and a plurality of search target documents simultaneously with high efficiency and high accuracy.

【０００８】また、請求項１の発明では、格納手段に格
納された各文書データについて単語頻度表と文書ノルム
表を予め計算して記憶しておき、検索対象および被検索
対象の文書データが指定されたところで、類似度算出手
段によって、該当する各文書データの単語頻度表および
文書ノルム表から文書データ間の類似度を求めること
で、連続して複数の検索対象文書と複数の被検索対象文
書を対象とした類似文書検索を行う場合、すなわち１つ
の検索対象文書と各被検索対象文書との類似度計算を行
った後、続けて次の１つの検索対象文書と各被検索対象
文書との類似度計算を行うような場合でも、各被検索対
象文書に対する単語頻度表の作成やノルム計算が各々一
回で済むので高速処理が可能となる。According to the first aspect of the invention, a word frequency table and a document norm table are calculated and stored in advance for each document data stored in the storage means, and the search target and the search target document data are designated. Then, the similarity calculating means obtains the similarity between the document data from the word frequency table and the document norm table of the corresponding document data, so that a plurality of documents to be searched and a plurality of documents to be searched are successively obtained. Is performed, that is, after calculating the similarity between one search target document and each search target document, the similarity search between the next one search target document and each search target document is performed. Even when the similarity calculation is performed, high-speed processing can be performed because the creation of the word frequency table and the norm calculation for each document to be searched need to be performed only once.

【０００９】さらに、請求項２記載の発明の類似文書検
索装置は、複数の文書データを格納する格納手段と、前
記格納手段に格納された文書データごとに、予め設定さ
れた検索条件単語ごとの出現頻度を求めて単語頻度表を
作成する単語頻度表作成手段と、前記格納手段に格納さ
れた文書データごとに、前記単語頻度表作成手段により
作成された単語頻度表中の検索条件単語ごとの出現頻度
を要素とする１次元ベクトルのノルムを算出して文書ノ
ルム表を作成する文書ノルム表作成手段と、前記格納手
段に格納された文書データの中から検索対象および被検
索対象の文書データを指定する指定手段と、前記指定手
段により指定された各文書データに対して前記単語頻度
表作成手段にて作成された各単語頻度表間の共通単語の
登録位置関係を示す単語対応表を作成する単語対応表作
成手段と、前記指定手段により指定された各文書データ
間の類似度を、前記単語頻度表、前記文書ノルム表およ
び前記単語対応表に基づき算出する類似度算出手段とを
有することを特徴とする。この発明では、各文書データ
間の類似度算出において、各単語頻度表のベクトルデー
タの内積計算に必要な、検索対象文書の単語頻度表と被
検索対象文書の単語頻度表との間での共通単語について
の出現頻度情報を単語対応表を参照することによって一
意に得られるので、より一層の処理の高速化を図ること
ができる。また、上記目的を達成するために、本発明の
類似文書検索方法は、請求項３に記載されるように、文
書データベースに格納された個々の文書データごとに、
予め設定された検索条件単語ごとの出現頻度を求めて単
語頻度表を作成する工程と、文書データベースに格納さ
れた個々の文書データごとに、前記作成された単語頻度
表中の検索条件単語ごとの出現頻度を要素とする１次元
ベクトルのノルムを算出して文書ノルム表を作成する工
程と、前記文書データベースに格納された文書データの
中から検索対象および被検索対象の文書データを指定す
る工程と、前記指定された各文書データ間の類似度を、
前記単語頻度表および前記文書ノルム表に基づき算出す
る工程とを有することを特徴とするものであり、この発
明の作用、効果は請求項１の発明のそれと等価である。Further, according to a second aspect of the present invention, there is provided a similar document search apparatus, wherein a storage means for storing a plurality of document data, and a predetermined search condition word for each of the document data stored in the storage means. A word frequency table creating means for creating a word frequency table by obtaining an appearance frequency; and for each document condition stored in the storage means, for each search condition word in the word frequency table created by the word frequency table creating means. A document norm table creating means for calculating a norm of a one-dimensional vector having an appearance frequency as an element to create a document norm table; and a search target and a search target document data from the document data stored in the storage means. The designation means to designate, and the registration position relationship of common words between the respective word frequency tables created by the word frequency table creation means for each document data designated by the designation means are shown. Word correspondence table creation means for creating a word correspondence table, and similarity calculation for calculating the similarity between the respective document data designated by the designation means based on the word frequency table, the document norm table and the word correspondence table. Means. According to the present invention, in calculating the similarity between the respective document data, the commonality between the word frequency table of the search target document and the word frequency table of the search target document, which is necessary for calculating the inner product of the vector data of each word frequency table Since the appearance frequency information of a word can be uniquely obtained by referring to the word correspondence table, it is possible to further speed up the processing. Further, in order to achieve the above object, a similar document search method according to the present invention provides, for each of the individual document data stored in a document database,
A step of creating a word frequency table by obtaining the appearance frequency of each preset search condition word; and for each document data stored in the document database, for each search condition word in the created word frequency table. Calculating a norm of a one-dimensional vector having an appearance frequency as an element to create a document norm table; and specifying document data to be searched and searched for from document data stored in the document database. , The similarity between the specified document data,
A step of calculating based on the word frequency table and the document norm table. The operation and effect of the present invention are equivalent to those of the first embodiment.

【００１０】さらに、本発明の類似文書検索方法は、請
求項４に記載されるように、文書データベースに格納さ
れた個々の文書データごとに、予め設定された検索条件
単語ごとの出現頻度を求めて単語頻度表を作成する工程
と、文書データベースに格納された個々の文書データご
とに、前記作成された単語頻度表中の検索条件単語ごと
の出現頻度を要素とする１次元ベクトルのノルムを算出
して文書ノルム表を作成する工程と、前記文書データベ
ースに格納された文書データの中から検索対象および被
検索対象の文書データを指定する工程と、前記指定され
た各文書データに対して作成された各単語頻度表間の共
通単語の登録位置関係を示す単語対応表を作成する工程
と、前記指定された各文書データ間の類似度を、前記単
語頻度表、前記文書ノルム表および単語対応表に基づき
算出する工程とを有することを特徴とものであり、この
発明の作用、効果は請求項２の発明のそれと等価であ
る。Further, according to the similar document search method of the present invention, the appearance frequency of each preset search condition word is determined for each document data stored in the document database. Creating a word frequency table, and calculating, for each document data stored in the document database, the norm of a one-dimensional vector having an appearance frequency for each search condition word in the created word frequency table as an element Creating a document norm table, specifying a search target and a search target document data from the document data stored in the document database, and generating a document norm table for each of the specified document data. Creating a word correspondence table indicating the registered positional relationship of common words between the respective word frequency tables, and calculating the similarity between the designated document data by using the word frequency table and the sentence. Is intended and characterized by a step of calculating, based on the norm table and word alignment table, the action of the present invention, the effect is the same equivalent of the invention of claim 2.

【００１１】[0011]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１２】図１は本発明に係る一実施形態である類似
文書検索装置のハードウェア構成を示す図である。FIG. 1 is a diagram showing a hardware configuration of a similar document search apparatus according to an embodiment of the present invention.

【００１３】同図に示すように、この類似文書検索装置
は、入力装置１、表示装置２、制御装置３、メモリ４、
外部記憶装置５および通信装置６から構成され、各装置
は互いにバスを介して結合されている。As shown in FIG. 1, the similar document search device includes an input device 1, a display device 2, a control device 3, a memory 4,
It comprises an external storage device 5 and a communication device 6, and each device is connected to each other via a bus.

【００１４】入力装置１はキーボード、マウス、タブレ
ット或いはタッチパネルなどからなり、文字列を入力し
たり、本装置に各種のデータおよび命令の入力を行う。
表示装置２はＣＲＴ或いは液晶ディスプレイと表示コン
トローラなどからなり、検索結果やシステムからユーザ
への指示を表示する。制御装置３はＣＰＵから構成さ
れ、各装置の制御、装置間のデータの転送などの処理を
行うものである。The input device 1 is composed of a keyboard, a mouse, a tablet, a touch panel or the like, and inputs a character string and inputs various data and commands to the apparatus.
The display device 2 includes a CRT or a liquid crystal display and a display controller, and displays search results and instructions from the system to the user. The control device 3 is configured by a CPU, and performs processes such as control of each device and transfer of data between the devices.

【００１５】メモリ４はＲＡＭなどからなり、図２、図
３に示すように、制御装置３が各種制御や処理を実行す
るためのプログラムを格納するプログラム部と、処理の
際に必要なデータを格納するためのバッファ部からなっ
ている。The memory 4 comprises a RAM or the like. As shown in FIGS. 2 and 3, the control device 3 stores a program for storing programs for executing various controls and processes, and stores data necessary for the processes. It consists of a buffer for storing.

【００１６】プログラム部は、図２に示すように、単語
抽出処理のための単語抽出部１０ａのほか、図３に示す
ように、処理全体の制御を行うメイン処理部１１ａ、メ
イン処理部１１ａで呼び出されるサブルーチンとして、
初期化部１１ｂ、ノルム計算部１１ｃ、単語対応表作成
部１１ｄ、類似度算出部１１ｅ、出力編集部１１ｆ、文
書一覧表示部１１ｇ、文書選択部１１ｈ、文書内容表示
部１１ｉからなる。The program section includes, as shown in FIG. 2, a main processing section 11a for controlling the entire processing, and a main processing section 11a for controlling the whole processing, as shown in FIG. As a called subroutine,
It comprises an initialization section 11b, a norm calculation section 11c, a word correspondence table creation section 11d, a similarity calculation section 11e, an output editing section 11f, a document list display section 11g, a document selection section 11h, and a document content display section 11i.

【００１７】バッファ部は、図２に示すように、単語抽
出部１０ａの作業領域としての、入力文書データ格納バ
ッファ１０ｂ、単語ＩＤ−単語表格納バッファ１０ｃ、
単語頻度表格納バッファ１０ｄおよび単語抽出表格納バ
ッファ１０ｅからなる。また、メイン処理部１１ａの作
業領域として、図３に示すように、検索対象・単語頻度
表格納バッファ１１ｊ、被検索対象・単語頻度表格納バ
ッファ１１ｋ、文書ノルム表格納バッファ１１ｌ、単語
対応表格納バッファ１１ｍ、類似度算出結果格納バッフ
ァ１１ｎ、出力編集結果格納バッファ１１ｏがある。As shown in FIG. 2, the buffer unit includes an input document data storage buffer 10b, a word ID-word table storage buffer 10c, and a work area for the word extraction unit 10a.
It comprises a word frequency table storage buffer 10d and a word extraction table storage buffer 10e. As the work areas of the main processing unit 11a, as shown in FIG. 3, a search target / word frequency table storage buffer 11j, a search target / word frequency table storage buffer 11k, a document norm table storage buffer 111, a word correspondence table storage There is a buffer 11m, a similarity calculation result storage buffer 11n, and an output editing result storage buffer 11o.

【００１８】入力文書データ格納バッファ１０ｂには図
４に示すような文書データが格納される。図４において
Ａ、Ｂ、Ｃは一文書の個々の節である。単語ＩＤ−単語
表格納バッファ１０ｃには、図５に示すように単語ＩＤ
と単語との対応を示す形式の単語ＩＤ−単語表データが
格納される。単語頻度表格納バッファ１０ｄには、図６
或いは図７に示すような形式の単語頻度表が格納され
る。検索対象・単語頻度表格納バッファ１１ｊには、図
６に示すような形式の検索対象文書の単語頻度表が格納
される。被検索対象・単語頻度表格納バッファ１１ｋに
は、図７に示すような形式の被検索対象文書の単語頻度
表が格納される。文書ノルム表格納バッファ１１ｌに
は、図８に示すような形式の文書ノルム表が格納され
る。単語対応表格納バッファ１１ｍには、図９に示すよ
うな形式の単語対応表が格納される。類似度算出結果格
納バッファ１１ｎには、図１０に示すような形式の類似
度算出結果が格納される。出力編集結果バッファ１１ｏ
には、図１１に示すような形式の類似度算出結果の出力
編集結果が格納される。The input document data storage buffer 10b stores document data as shown in FIG. In FIG. 4, A, B, and C are individual sections of one document. In the word ID-word table storage buffer 10c, as shown in FIG.
A word ID-word table data in a format indicating the correspondence between the word and the word is stored. In the word frequency table storage buffer 10d, FIG.
Alternatively, a word frequency table in a format as shown in FIG. 7 is stored. The search target / word frequency table storage buffer 11j stores a word frequency table of a search target document in a format as shown in FIG. The search target / word frequency table storage buffer 11k stores a word frequency table of the search target document in a format as shown in FIG. The document norm table storage buffer 111 stores a document norm table in a format as shown in FIG. The word correspondence table storage buffer 11m stores a word correspondence table in a format as shown in FIG. The similarity calculation result storage buffer 11n stores similarity calculation results in a format as shown in FIG. Output editing result buffer 11o
Stores an output editing result of a similarity calculation result in a format as shown in FIG.

【００１９】バッファ部にはその他、作業用変数のため
の領域１０ｆ、１１ｐが確保されている。これら作業用
変数のための領域内１０ｆ、１１ｐには、後で述べる処
理で用いられる変数として、検索対象・単語頻度表の総
数ｄａｔ、被検索対象・単語頻度表の総数ｄｂｔ、総文
書数ｄｔ、現在参照中の検索対象・単語頻度表のＩＤと
してｄａ、被検索対象・単語頻度表のＩＤとしてｄｂな
どが確保される。In the buffer section, other areas 10f and 11p for working variables are secured. In the areas 10f and 11p for these work variables, variables used in the processing described later include the total number dat of the search target / word frequency table, the total number dbt of the search target / word frequency table, and the total number of documents dt. In addition, da is secured as the ID of the search target / word frequency table currently being referred to, and db is secured as the ID of the search target / word frequency table.

【００２０】外部記憶装置５はハードディスク或いはフ
ラッシュメモリ或いは光磁気ディスクとコントローラか
らなり、文書データ、単語頻度表、単語抽出表などを格
納する。この外部記憶装置５に格納されている文書デー
タおよび単語頻度表の格納形式を図１２に、単語抽出表
の格納形式を図１３にそれぞれ示す。図１２に示すよう
に、各文書データは、文書タイトル、ファイル作成日
時、作成者などの属性データなどを格納したへッダ部
と、文書を構成する文字コードデータであるテキストデ
ータ部からなっており、これらがＩＤ順に格納されてい
る。単語抽出表はテキストデータから単語頻度表を作成
する際に単語を抽出するために参照される。この単語抽
出表は例えばオペレータによって任意に設定される。The external storage device 5 comprises a hard disk, a flash memory or a magneto-optical disk and a controller, and stores document data, a word frequency table, a word extraction table, and the like. FIG. 12 shows the storage format of the document data and the word frequency table stored in the external storage device 5, and FIG. 13 shows the storage format of the word extraction table. As shown in FIG. 12, each document data includes a header section storing attribute data such as a document title, a file creation date and time, and a creator, and a text data section which is character code data constituting the document. And these are stored in order of ID. The word extraction table is referred to when extracting a word when creating a word frequency table from text data. This word extraction table is arbitrarily set by, for example, an operator.

【００２１】前記の図５に示した単語ＩＤ−単語表は、
図１３の単語抽出表に基づいて作成されたものであり、
単語ＩＤに対応する実際の単語が記述されている。この
単語ＩＤ−単語表は単語ＩＤと単語との間の変換に用い
られるものである。The word ID-word table shown in FIG.
It is created based on the word extraction table of FIG.
An actual word corresponding to the word ID is described. This word ID-word table is used for conversion between word IDs and words.

【００２２】図６に示した単語頻度表は単語抽出処理に
よって個々の文書データから作成されたもので、この単
語頻度表には文書の構成要素である節ごとの、単語ＩＤ
−単語表に記述された単語ＩＤ（単語抽出表に記述され
た単語）に対応する単語の出現頻度が記述される。な
お、図６に示す単語頻度表の内容は図４に示した文書デ
ータを単語抽出処理した例である。この単語頻度表に
は、文書データの節Ａ〜Ｚごとに単語出現頻度情報が記
述される。The word frequency table shown in FIG. 6 is created from individual document data by a word extraction process. The word frequency table includes a word ID for each section which is a component of the document.
-The appearance frequency of the word corresponding to the word ID described in the word table (the word described in the word extraction table) is described. Note that the contents of the word frequency table shown in FIG. 6 are examples in which the document data shown in FIG. 4 is subjected to word extraction processing. In this word frequency table, word appearance frequency information is described for each of the nodes A to Z of the document data.

【００２３】図８に示した文書ノルム表には、個々の単
語頻度表に対する、節（Ａ〜Ｚ）ごとのベクトルの大き
さ（ノルム）が記述される。In the document norm table shown in FIG. 8, the magnitude (norm) of a vector for each clause (A to Z) for each word frequency table is described.

【００２４】図９に示した単語対応表は類似度算出処理
が行われる際に一時的に作成される表である。この単語
対応表には、図６に示した検索対象文書の単語頻度表に
記載されている単語ＩＤと、この単語ＩＤと図７に示し
た被検索対象文書の単語頻度表における同一の単語ＩＤ
とを対応付けるための情報が記述されている。具体的に
は、検索対象文書と被検索対象文書において共通に存在
する単語ＩＤの場合、前記対応付け情報として被検索対
象文書の単語頻度表上での当該単語ＩＤのオフセット位
置が記述されている。被検索対象文書において存在しな
い単語ＩＤの場合はこれを示す値（例えば（−１））が
記述される。この単語対応表は文書データ間の類似度算
出の際に内積を必要最小限の計算で求めることを目的と
して利用される。The word correspondence table shown in FIG. 9 is a table temporarily created when the similarity calculation process is performed. In this word correspondence table, the word ID described in the word frequency table of the search target document shown in FIG. 6 and the same word ID in the word frequency table of the search target document shown in FIG.
And information for associating with. Specifically, in the case of a word ID that is commonly present in the search target document and the search target document, the offset position of the word ID on the word frequency table of the search target document is described as the association information. . In the case of a word ID that does not exist in the search target document, a value indicating this (for example, (-1)) is described. This word correspondence table is used for the purpose of calculating the inner product by the minimum necessary calculation when calculating the similarity between document data.

【００２５】図１０に示す類似度算出結果は、類似度算
出処理によって作成される表で、一つの検索対象文書に
対する各被検索対象文書の各々の類似度が記述される。
ここで、全ての被検索対象文書をｒ個のブロックに分割
したとき、１つの検索対象文書に対応する類似度算出結
果はｒ（図１０の例では２）個作成される。すなわち、
図１４（ａ）において、検索対象文書としてａ１〜ａｎ
のｎ個、被検索対象文書としてｂｌ〜ｂｎのｎ個が与え
られているものとする。ｎ＝１０とし、分割数ｒ＝２と
すると、１つのブロックの文書数は５となる。この例で
は、ａ１の検索対象文書に対し図１４（ｃ）に示すよう
に類似度算出結果としてｏ11（ｂｌ〜ｂ５に対応する類
似度）とｏ12（ｂｎ−４〜ｂｎに対応する類似度）の２
つが得られる。The similarity calculation result shown in FIG. 10 is a table created by the similarity calculation process, and describes each similarity of each search target document with respect to one search target document.
Here, when all the search target documents are divided into r blocks, r (2 in the example of FIG. 10) similarity calculation results corresponding to one search target document are created. That is,
In FIG. 14A, as search target documents, a1 to an
, And n documents bl to bn are given as search target documents. Assuming that n = 10 and the number of divisions r = 2, the number of documents in one block is 5. In this example, as shown in FIG. 14C, as the similarity calculation results, o11 (similarity corresponding to bl to b5) and o12 (similarity corresponding to bn-4 to bn) are obtained for the search target document a1. 2
One is obtained.

【００２６】通信装置６は、通信回線を介して外部とデ
ータのやり取りを行う装置であり、たとえばＬＡＮ回線
とＬＡＮコントローラ等から構成される。The communication device 6 is a device for exchanging data with the outside via a communication line, and includes, for example, a LAN line and a LAN controller.

【００２７】以下、本実施形態の動作について説明す
る。The operation of this embodiment will be described below.

【００２８】外部記憶装置５には図１２に示すように、
文書ＩＤが０〜ｎのｎ個の文書データと個々の文書デー
タに対応する単語頻度表が格納されているものとする。As shown in FIG. 12, the external storage device 5
It is assumed that n document data having document IDs 0 to n and a word frequency table corresponding to each document data are stored.

【００２９】この動作説明では、はじめに、本装置の全
体的な処理動作を説明し、その後で個々の重要な処理に
ついての詳細を説明することにする。In the description of the operation, first, the overall processing operation of the present apparatus will be described, and then details of each important processing will be described.

【００３０】本装置におけるプログラム処理は単語抽出
処理とメイン処理の２つからなる。単語抽出処理は文書
データから単語頻度表を作成する処理である。メイン処
理は単語抽出処理で作成された単語頻度表を用いて類似
文書の検索を行う処理である。したがって単語抽出処理
はメイン処理よりも前に行われる。The program processing in the present apparatus includes two processes, a word extraction process and a main process. The word extraction process is a process for creating a word frequency table from document data. The main process is a process of searching for a similar document using the word frequency table created in the word extraction process. Therefore, the word extraction processing is performed before the main processing.

【００３１】図１５は単語抽出処理の全体的な手順を示
すフローチャートである。FIG. 15 is a flowchart showing the overall procedure of the word extraction process.

【００３２】単語抽出部１０ａは起動後直ちにバッファ
部１０ｂ〜１０ｆ内の各バッファをクリアする。また、
単語抽出部１０ａは外部記憶装置５に格納されている単
語抽出表（図１３）を単語抽出表格納バッファ１０ｄ
に、単語ＩＤ−単語表（図５）を単語ＩＤ−単語表格納
バッファ１０ｃにそれぞれ格納する（ステップＳ１）。
その際、単語抽出部１０ａは単語抽出表（図１３）を参
照して単語ＩＤ−単語表の作成を行い、これを単語ＩＤ
−単語表格納バッファ１０ｃに格納する。The word extracting unit 10a clears the buffers in the buffer units 10b to 10f immediately after the activation. Also,
The word extraction unit 10a stores the word extraction table (FIG. 13) stored in the external storage device 5 in a word extraction table storage buffer 10d.
Then, the word ID-word table (FIG. 5) is stored in the word ID-word table storage buffer 10c (step S1).
At this time, the word extracting unit 10a creates a word ID-word table with reference to the word extraction table (FIG. 13), and
-Store in the word table storage buffer 10c.

【００３３】次に、単語抽出部１０ａは外部記憶装置５
に格納された文書データを読み出し、この文書データに
対応する単語頻度表（図６、図７）を作成し、これを外
部記憶装置５に格納する（ステップＳ２）。全ての文書
データに対する単語頻度表の作成および格納を終えると
単語抽出処理を終了する（ステップＳ３）。Next, the word extracting section 10a is connected to the external storage device 5
Is read out, word frequency tables (FIGS. 6 and 7) corresponding to the document data are created, and stored in the external storage device 5 (step S2). When the creation and storage of the word frequency tables for all the document data have been completed, the word extraction processing ends (step S3).

【００３４】次に、メイン処理の動作を図１６により説
明する。Next, the operation of the main processing will be described with reference to FIG.

【００３５】メイン処理部１１ａが起動されると、まず
初期化部１１ｂにて各バッファ部１１ｊ〜１１ｐのクリ
アや、検索対象文書と被検索対象文書の単語頻度表の数
を変数にセットするなど、類似文書検索に必要な各種の
初期化処理を行う（ステップＳ４）。When the main processing unit 11a is started, the initialization unit 11b first clears each of the buffer units 11j to 11p, and sets the number of word frequency tables of the documents to be searched and the documents to be searched as variables. Then, various initialization processes necessary for similar document search are performed (step S4).

【００３６】この後、ノルム計算部１１ｃが起動され
る。ノルム計算部１１ｃは各単語頻度表に格納された検
索対象文書および被検索対象文書の各節ごとに、単語Ｉ
Ｄごとの単語出現頻度を要素とする１次元ベクトルのノ
ルム（＝ベクトルの大きさ）を計算し、文書ノルム表格
納バッファ１１ｌ上に文書ノルム表を作成して外部記憶
装置５へ出力する（ステップＳ５）。Thereafter, the norm calculation unit 11c is started. The norm calculation unit 11c calculates the word I for each section of the search target document and the search target document stored in each word frequency table.
The norm (= vector size) of a one-dimensional vector having the word appearance frequency as an element for each D is calculated, a document norm table is created on the document norm table storage buffer 111, and output to the external storage device 5 (step). S5).

【００３７】文書ノルム表が作成されると、メイン処理
部１１ａは初期化処理Ｓ４で指定された数の検索対象文
書の単語頻度表（図６）を外部記憶装置５から読み込
み、検索対象・単語頻度表格納バッファ１１ｊに格納す
る（ステップＳ６）。次にメイン処理部１１ａは、指定
された数の被検索対象文書の単語頻度表（図７）を外部
記憶装置５から読み込み、被検索対象・単語頻度表格納
バッファ１１ｋに格納する（ステップＳ７）。When the document norm table is created, the main processing section 11a reads the word frequency table (FIG. 6) of the number of documents to be searched specified in the initialization processing S4 from the external storage device 5, and retrieves the words to be searched. The data is stored in the frequency table storage buffer 11j (step S6). Next, the main processing unit 11a reads the word frequency table (FIG. 7) of the designated number of documents to be searched from the external storage device 5 and stores it in the search object / word frequency table storage buffer 11k (step S7). .

【００３８】この後、メイン処理部１１ａは検索対象・
単語頻度表格納バッファ１１ｊと被検索対象・単語頻度
表格納バッファ１１ｋから各々１つずつ単語頻度表を取
り出し、単語対応表作成部１１ｄへ処理を渡す。単語対
応表作成部１１ｄにて単語対応表（図９）が作成され、
単語対応表格納バッファ１１ｍにこれが格納されると
（ステップＳ８）、メイン処理部１１ａは類似度算出部
１１ｅに処理をわたす。類似度算出部１１ｅは、単語対
応表と文書ノルム表を参照して、検索対象文書と被検索
対象文書の各単語頻度表の各節ごとに両者の類似度を算
出する。そして、節ごとの類似度の平均値をとるなどし
て文書間の類似度を算出し、この類似度算出結果を外部
記憶装置５へ出力する（ステップＳ９）。Thereafter, the main processing unit 11a searches for
One word frequency table is extracted from each of the word frequency table storage buffer 11j and the search target / word frequency table storage buffer 11k, and the process is passed to the word correspondence table creating unit 11d. The word correspondence table creation unit 11d creates a word correspondence table (FIG. 9),
When this is stored in the word correspondence table storage buffer 11m (step S8), the main processing unit 11a passes the processing to the similarity calculation unit 11e. The similarity calculating unit 11e calculates the similarity between each of the word frequency tables of the search target document and the search target document with reference to the word correspondence table and the document norm table. Then, the similarity between the documents is calculated by, for example, averaging the similarity for each node, and the similarity calculation result is output to the external storage device 5 (step S9).

【００３９】そして、検索対象・単語頻度表格納バッフ
ァ１１ｊに格納されたすべて（指定数分）の検索対象文
書の単語頻度表について、検索対象・単語頻度表格納バ
ッファ１１ｋに格納されたすべて（指定数分）の被検索
対象文書の単語頻度表との類似度算出が完了するまでス
テップＳ８、ステップＳ９を繰り返す（ステップＳ１
０）。Then, with respect to the word frequency tables of all the documents to be searched (for the specified number) stored in the search target / word frequency table storage buffer 11j, all of the word storage tables (designated numbers) stored in the search target / word frequency table storage buffer 11k are stored. Steps S8 and S9 are repeated until the similarity calculation with the word frequency table of the searched document (several minutes) is completed (step S1).
0).

【００４０】この後、次の指定数分の被検索対象文書の
単語頻度表を外部記憶装置５から読み込み、これらの単
語頻度表で被検索対象・単語頻度表格納バッファ１１ｋ
を書き換えて、検索対象・単語頻度表格納バッファ１１
ｊに格納されたすべて（指定数分）の検索対象文書の単
語頻度表について、前記と同様に検索対象・単語頻度表
格納バッファ１１ｋに格納されたすべて（指定数分）の
被検索対象文書の単語頻度表との類似度算出が完了する
までステップＳ７〜Ｓ１０を繰り返す（ステップＳ１
１）。Thereafter, the word frequency tables of the next specified number of documents to be searched are read from the external storage device 5, and the word frequency tables for the documents to be searched are stored in these word frequency tables.
Is rewritten, and the search target / word frequency table storage buffer 11
As for the word frequency tables of all (specified number of) search target documents stored in j, the search target / word frequency table storage buffer 11k stores all (specified number of) search target documents in the same manner as described above. Steps S7 to S10 are repeated until the calculation of the similarity with the word frequency table is completed (step S1).
1).

【００４１】その後、次の指定数分の検索対象文書の単
語頻度表を外部記憶装置５から読み込み、これらの単語
頻度表で検索対象・単語頻度表格納バッファ１１ｊを書
き替えて同様に類似度計算を行う（ステップＳ１２）。Thereafter, the word frequency tables of the next specified number of search target documents are read from the external storage device 5, and the search target / word frequency table storage buffer 11j is rewritten with these word frequency tables to similarly calculate the similarity. Is performed (step S12).

【００４２】次に、このステップＳ６からＳ１２までの
処理について図１４を参照して具体的に説明する。Next, the processing from steps S6 to S12 will be specifically described with reference to FIG.

【００４３】まず、図１４（ａ）の太線枠で囲まれてい
る範囲、つまり複数の検索対象文書の単語頻度表ａ１〜
ａ５と複数の被検索対象文書の単語頻度表ｂ１〜ｂ５を
メモリ上に一括して書き込み、先頭の検索対象文書の単
語頻度表ａ1 と各被検索対象文書の単語頻度表ｂ１〜ｂ
５との各々の類似度を計算して図１０に示したような類
似度算出結果ｏ11を出力する。続いて、図１４（ｂ）に
示すように、次の検索対象文書の単語頻度表ａ２と被検
索対象文書の単語頻度表ｂ１〜ｂ５との各々の類似度計
算を行って類似度算出結果ｏ21を出力する。同様に検索
対象文書の単語頻度表ａ３，ａ４，ａ５と被検索対象文
書の単語頻度表ｂ１〜ｂ５との各々の類似度計算を行っ
て類似度算出結果ｏ31，ｏ41，ｏ51を出力する。First, the range surrounded by the thick line frame in FIG. 14A, that is, the word frequency tables a1 to a1 of a plurality of search target documents.
a5 and the word frequency tables b1 to b5 of a plurality of documents to be searched are collectively written in the memory, and the word frequency table a1 of the first document to be searched and the word frequency tables b1 to b of each document to be searched.
5 and outputs a similarity calculation result o11 as shown in FIG. Subsequently, as shown in FIG. 14B, the similarity calculation is performed for each of the word frequency table a2 of the next search target document and the word frequency tables b1 to b5 of the search target document, and the similarity calculation result o21 is obtained. Is output. Similarly, the similarity calculation is performed between the word frequency tables a3, a4, a5 of the search target document and the word frequency tables b1 to b5 of the search target document, and the similarity calculation results o31, o41, o51 are output.

【００４４】次に、図１４（ｃ）に示すように、メモリ
上のｂ１〜ｂ５の被検索対象文書の単語頻度表の確保領
域を解放し、そこにｂｎ−４〜ｂｎの被検索対象文書の
単語頻度表を書き込む。そして前記と同様に、検索対象
文書の単語頻度表ａ１〜ａ５と被検索対象文書の単語頻
度表ｂｎ−４〜ｂｎとの各々類似度を計算して類似度算
出結果ｏ12を出力する。Next, as shown in FIG. 14C, the reserved area of the word frequency table of the search target documents b1 to b5 on the memory is released, and the search target documents bn-4 to bn are stored there. Write the word frequency table of Then, in the same manner as described above, the similarity between the word frequency tables a1 to a5 of the search target document and the word frequency tables bn-4 to bn of the search target document is calculated, and the similarity calculation result o12 is output.

【００４５】さらにこの後、図１４（ｄ）に示すよう
に、メモリ上のａ１〜ａ５の検索対象文書の単語頻度表
の確保領域を解放し、そこに次の指定数分の検索対象文
書の単語頻度表ａｎ−４〜ａｎを書き込むとともに、メ
モリ上のｂｎ−４〜ｂｎの被検索対象文書の単語頻度表
の確保領域を解放し、そこに再び被検索対象文書の単語
頻度表ｂ１〜ｂ５を書き込み、前記と同様に検索対象文
書の単語頻度表ａｎ−４〜ａｎと被検索対象文書の単語
頻度表ｂ１〜ｂ５との各々の類似度を計算し、続いてメ
モリ上の被検索対象文書の単語頻度表ｂ１〜ｂ５の確保
領域を解放してそこに被検索対象文書の単語頻度表ｂｎ
−４〜ｂｎを書き込み、同様に類似度を計算する。Thereafter, as shown in FIG. 14D, the reserved area of the word frequency table of the search target documents a1 to a5 in the memory is released, and the next specified number of search target documents are stored there. The word frequency tables an-4 to an are written, the reserved area of the word frequency table of the search target documents bn-4 to bn on the memory is released, and the word frequency tables b1 to b5 of the search target documents are re-stored there. Is written, and the similarities between the word frequency tables an-4 to an of the search target document and the word frequency tables b1 to b5 of the search target document are calculated in the same manner as described above. The reserved areas of the word frequency tables b1 to b5 are released, and the word frequency table bn of the search target document is stored therein.
-4 to bn are written, and the similarity is calculated in the same manner.

【００４６】以上、外部記憶装置５に存在するすべての
検索対象文書の単語頻度表とすべての被検索対象文書の
単語頻度表との類似度が算出され、その類似度算出結果
が外部記憶装置５に記憶されると、メイン処理部１１ａ
は出力編集部１１ｆへ制御を移す。As described above, the similarities between the word frequency tables of all the documents to be searched existing in the external storage device 5 and the word frequency tables of all the documents to be searched are calculated. Is stored in the main processing unit 11a
Transfers control to the output editing unit 11f.

【００４７】出力編集部１１ｆは、外部記憶装置５に格
納されている、個々の検索対象文書に対する全被検索対
象文書との間の類似度算出結果を順次読み込み、例えば
類似度の高いものを上位に配置するなどのソートを行
い、図１１に示すような出力編集結果として外部記憶装
置５に格納する（ステップＳ１３）。The output editing unit 11f sequentially reads the similarity calculation results of the individual search target documents and all the search target documents stored in the external storage device 5, and, for example, ranks a higher similarity higher rank. Are stored in the external storage device 5 as output editing results as shown in FIG. 11 (step S13).

【００４８】すべての検索対象文書に対応する出力編集
結果が外部記憶装置５に記憶されたところで、メイン処
理部１１ａは、図１８に示すような文書一覧画面を文書
一覧表示部１１ｇを介して表示装置２に出力し（ステッ
プＳ１４）、引き続き文書選択部１１ｈを起動する（ス
テップＳ１５）。この文書一覧画面には、ユーザが検索
対象文書のＩＤを入力するための入力項目が設けられて
おり、この入力項目に文書選択部１１ｈを通じて検索対
象文書ＩＤ（例えばａ１）を入力することで、その検索
結果となる被検索対象文書ＩＤの一覧と各々の類似度が
表示される。When the output and editing results corresponding to all the search target documents are stored in the external storage device 5, the main processing section 11a displays a document list screen as shown in FIG. 18 via the document list display section 11g. The document is output to the device 2 (step S14), and the document selection unit 11h is subsequently activated (step S15). The document list screen is provided with input items for the user to input the ID of the search target document. By inputting the search target document ID (for example, a1) to the input item through the document selection unit 11h, A list of search target document IDs that are the search results and their similarities are displayed.

【００４９】また、被検索対象文書ＩＤに対応するチェ
ックボックスをマウスなどのポインティングディバイス
などで選択することにより、文書内容表示部１１ｉが起
動され、図１９に示すように、選択された文書の類似度
とその文書内容が表示される（ステップＳ１６）。By selecting a check box corresponding to the search target document ID with a pointing device such as a mouse, the document content display section 11i is activated and, as shown in FIG. The degree and the contents of the document are displayed (step S16).

【００５０】これら類似検索結果の表示処理は、図１８
または図１９に示す画面左上のクローズボックスをポイ
ンティングディバイスなどで選択することで終了する
（ステップＳ１７）。The display processing of these similar search results is shown in FIG.
Alternatively, the process ends by selecting a close box at the upper left of the screen shown in FIG. 19 with a pointing device or the like (step S17).

【００５１】次に、ステップＳ２の単語抽出処理、ステ
ップＳ４の初期化処理、ステップＳ５のノルム計算処
理、ステップＳ８の単語対応表作成処理、ステップＳ９
の類似度算出処理、ステップＳ１３の出力編集処理につ
いて具体的に説明する。Next, a word extraction process in step S2, an initialization process in step S4, a norm calculation process in step S5, a word correspondence table creation process in step S8, and a step S9
The similarity calculation process and the output editing process in step S13 will be specifically described.

【００５２】まず、ステップＳ２の単語抽出処理につい
て説明する。図２０はステップＳ２の単語抽出処理を示
す詳細フローチャートである。First, the word extraction process in step S2 will be described. FIG. 20 is a detailed flowchart showing the word extraction processing in step S2.

【００５３】外部記憶装置５に文書データが格納された
状態で、初期化処理ステップＳ１で文書数などの環境が
設定されると単語抽出部１０ａが起動する。単語抽出部
１０ａは指定された入力文書データを入力文書データ格
納バッファ１０ｂに格納する（ステップＳ２１）。次
に、単語抽出部１０ａは単語抽出表を参照し、入力文書
データ中から単語抽出表に含まれる単語を抽出して作業
用領域に格納する（ステップＳ２２）。When the environment such as the number of documents is set in the initialization processing step S1 in a state where the document data is stored in the external storage device 5, the word extracting unit 10a starts. The word extracting unit 10a stores the specified input document data in the input document data storage buffer 10b (Step S21). Next, the word extracting unit 10a refers to the word extraction table, extracts words included in the word extraction table from the input document data, and stores the words in the work area (step S22).

【００５４】次に、単語抽出部１０ａは単語ＩＤ−単語
表を参照し、抽出された単語が既に単語ＩＤ−単語表に
登録されていなければ制御をステップＳ２５へ移し、登
録されていれば制御をステップＳ２４へ移す（ステップ
Ｓ２３）。ステップＳ２４では、単語ＩＤ−単語表に未
登録の単語の追加処理を行う。また、該当の単語に対し
て新規の単語ＩＤを発行する（ステップＳ２４）。ステ
ップＳ２２で格納された単語は、単語ＩＤ−単語表に基
づいて単語ＩＤに置き換えられる（ステップＳ２５）。Next, the word extracting section 10a refers to the word ID-word table, and if the extracted word is not already registered in the word ID-word table, shifts the control to step S25. To step S24 (step S23). In step S24, a process of adding a word that has not been registered in the word ID-word table is performed. Further, a new word ID is issued for the corresponding word (step S24). The word stored in step S22 is replaced with a word ID based on the word ID-word table (step S25).

【００５５】次に、単語抽出部１０ａは作業領域に格納
された単語を再集計し、文書データの節ごとに単語（単
語抽出表に含まれる単語と同じ単語）の出現頻度を調
べ、その結果を単語頻度表格納バッファ１１ｍに書き込
む（ステップＳ２６）。そして文書データのすべての節
について単語出現頻度の算出および算出結果の書き込み
が完了したところで次のステップ２８に移行する（ステ
ップＳ２７）。最後に、単語抽出部１０ａは単語頻度表
格納バッファ１１ｍに書き込まれた内容を単語頻度表と
して外部記憶装置５に出力する（ステップＳ２８）。Next, the word extraction unit 10a recounts the words stored in the work area, checks the appearance frequency of the word (the same word as the word included in the word extraction table) for each section of the document data, and as a result, Is written into the word frequency table storage buffer 11m (step S26). When the calculation of the word appearance frequency and the writing of the calculation result have been completed for all the sections of the document data, the process proceeds to the next step 28 (step S27). Finally, the word extraction unit 10a outputs the contents written in the word frequency table storage buffer 11m to the external storage device 5 as a word frequency table (Step S28).

【００５６】次に、ステップＳ４の初期化処理について
説明する。図２１はこの初期化処理の詳細を示すフロー
チャートである。Next, the initialization processing in step S4 will be described. FIG. 21 is a flowchart showing details of the initialization processing.

【００５７】外部記憶装置５に単語頻度表が格納される
と、メイン処理部１１ａは初期化部１１ｂへ制御を移
す。初期化部１１ｂは作業用変数のための領域１１ｐ
に、作業用変数として検索対象文書と被検索対象文書の
各単語頻度表格納バッファ１１ｊ，１１ｋに単語頻度表
を格納する際に、１回に読み込む文書数：ｓｅｐ、検索
対象・単語頻度表の総数：ｄａｔ、被検索対象・単語頻
度表の総数：ｄｂｔ、単語頻度表の総数：ｄｔ、計算対
象となる検索対象・単語頻度表のＩＤ：ｄａ、計算対象
となる被検索対象・単語頻度表のＩＤ：ｄｂを用意す
る。また、これらの変数に初期値を代入する。本実施形
態では、読み込む文書数ｓｅｐに５、それ以外の変数０
を代入する（ステップＳ４１）。When the word frequency table is stored in the external storage device 5, the main processing unit 11a transfers control to the initialization unit 11b. The initialization unit 11b stores an area 11p for a work variable.
When storing the word frequency tables in the word frequency table storage buffers 11j and 11k of the search target document and the search target document as work variables, the number of documents read at one time: sep, and the search target / word frequency table Total number: dat, total number of search target / word frequency tables: dbt, total number of word frequency tables: dt, ID of search target / word frequency table to be calculated: da, target search / word frequency table to be calculated ID: db is prepared. In addition, initial values are assigned to these variables. In the present embodiment, the number of documents to be read sep is 5, and other variables are 0.
Is substituted (step S41).

【００５８】この後、初期化部１１ｂは外部記憶装置５
中の検索対象文書の単語頻度表の数を調べ、その値を検
索対象・単語頻度表の総数ｄａｔへ代入する（ステップ
Ｓ４２）。さらに、初期化部１１ｂは外部記憶装置５中
の被検索対象文書の単語頻度表の数を調べ、被検索対象
・単語頻度表の総数ｄｂｔへ代入する（ステップＳ４
３）。最後に、初期化部１１ｂは、検索対象・単語頻度
表の総数ｄａｔと、被検索対象・単語頻度表の総数ｄｂ
ｔの和を単語頻度表の総数ｄｔへ代入する（ステップＳ
４４）。Thereafter, the initialization unit 11b stores the external storage device 5
The number of word frequency tables of the search target document is checked, and the value is substituted for the total number dat of the search target / word frequency table (step S42). Further, the initialization unit 11b checks the number of word frequency tables of the search target document in the external storage device 5 and substitutes it into the total number dbt of the search target / word frequency tables (step S4).
3). Finally, the initialization unit 11b calculates the total number dat of the search target / word frequency table and the total number db of the search target / word frequency table.
t is substituted for the total number dt of the word frequency table (step S
44).

【００５９】次に、ステップＳ５のノルム計算処理につ
いて説明する。図２２はこのノルム計算処理の詳細を示
すフローチャートである。Next, the norm calculation processing in step S5 will be described. FIG. 22 is a flowchart showing details of the norm calculation processing.

【００６０】ノルムとは、１次元の行列をＡ＝［ａ１，
ａ２，…，ａｎ］としたとき、The norm means that a one-dimensional matrix is represented by A = [a1,
a2, ..., an],

【数１】で表せられる値のことである。本実施形態では、１つの
文書データが複数の節に分割されるため、文書ノルム表
の文書ＩＤごとに節の数と等しいｐ個のノルムが計算さ
れる。(Equation 1) Is the value represented by In the present embodiment, since one document data is divided into a plurality of sections, p norms equal to the number of sections are calculated for each document ID in the document norm table.

【００６１】ノルム計算処理は次のようにして行われ
る。初期化部１１ｂの処理が終わるとメイン処理部１１
ａに制御が戻され、メイン処理部１１ａはノルム計算部
１１ｃへ制御を移す。ノルム計算部１１ｃは、作業用変
数のための領域１１ｐに、作業用変数として、参照中の
節：ｐ、節の総数：ｐｔ、参照中の単語ＩＤ：ｑ、単語
の総数：ｑｔ、参照中の文書：ｄ、参照中の単語出現頻
度：ｆ、作業用：ｗ、ノルム：ｎｏｒｍを用意する。ま
た、これらの変数に初期値０を代入する（ステップＳ５
０１）。The norm calculation processing is performed as follows. When the processing of the initialization unit 11b is completed, the main processing unit 11
The control is returned to a, and the main processing unit 11a transfers the control to the norm calculation unit 11c. The norm calculation unit 11c stores, in the area 11p for the working variable, as the working variable, the clause being referred to: p, the total number of clauses: pt, the word ID being referenced: q, the total number of words: qt, Document: d, frequency of appearance of the word being referred to: f, work: w, norm: norm. Further, the initial value 0 is substituted into these variables (step S5).
01).

【００６２】次に、ノルム計算部１１ｃは外部記憶装置
５から文書ノルムを読み出し、文書ノルム表格納バッフ
ァ１１ｌヘ格納する（ステップＳ５０２）。続いてノル
ム計算部１１ｃは、外部記憶装置５に格納されたｄ番目
の単語頻度表を、検索対象・単語頻度表格納バッファ１
１ｊに書き込む（ステップＳ５０３）。さらにノルム計
算部１１ｃは検索対象・単語頻度表格納バッファ１１ｊ
から節の総数と単語の総数を読み出し、それぞれ変数ｐ
ｔ、ｑｔへ代入する（ステップＳ５０４）。Next, the norm calculation unit 11c reads out the document norm from the external storage device 5 and stores it in the document norm table storage buffer 111 (step S502). Subsequently, the norm calculation unit 11c stores the d-th word frequency table stored in the external storage device 5 in the search target / word frequency table storage buffer 1.
1j (step S503). Further, the norm calculation unit 11c stores a search target / word frequency table storage buffer 11j.
Read the total number of clauses and the total number of words from
Substitute t and qt (step S504).

【００６３】続いて、ノルム計算部１１ｃは指定された
節：ｐ、単語ＩＤ：ｑごとの単語出現頻度を調べ、単語
出現頻度：ｆへ代入する（ステップＳ５０５）。また、
指定された節：ｐのノルムを求めるため、指定された単
語ＩＤ：ｑの単語出現頻度ｆを２乗した値を作業変数ｗ
に累積加算して行く（ステップＳ５０６）。さらに、単
語ＩＤを進めるためｑの値を１加算する（ステップＳ５
０７）。そして指定する節の単語すべてが参照されるま
でステップＳ５０５〜Ｓ５０７を繰り返す（ステップＳ
５０８）。Subsequently, the norm calculation unit 11c checks the word appearance frequency for each of the designated clause: p and the word ID: q, and substitutes the word appearance frequency for f (step S505). Also,
To find the norm of the specified clause: p, the value obtained by squaring the word appearance frequency f of the specified word ID: q is used as the work variable w.
(Step S506). Further, the value of q is incremented by 1 to advance the word ID (step S5).
07). Steps S505 to S507 are repeated until all the words of the designated section are referred to (step S505).
508).

【００６４】次に、ノルム計算部１１ｃは節ｐのノルム
の計算する。節ｐのノルムはｎｏｒｍ←Ｗ^1/2によって
求められ、求められた節ｐのノルムは文書ノルム表格納
バッファ１１ｌの該当する箇所へ格納される（ステップ
Ｓ５０９）。続いて次の節のノルムを計算するため、単
語ｑとノルム作業領域ｗをクリアする（ステップＳ５１
０）。全ての節のノルム計算が終了したならば制御をス
テップＳ５１２へ移す（ステップＳ５１１）。ステップ
Ｓ５１２では、次の文書ΙＤを進めるためにｄの値を１
つ進める。全ての文書について以上のノルム計算を終了
したところで制御がステップＳ５１４に移り（ステップ
Ｓ５１３）、最後に、文書ノルム表格納バッファ１１ｌ
の内容を文書ノルム表として外部記憶装置５へ出力（ス
テップＳ５１４）。Next, the norm calculator 11c calculates the norm of the node p. The norm of the node p is obtained by norm ← W ^1/2 , and the obtained norm of the node p is stored in a corresponding portion of the document norm table storage buffer 111 (step S509). Subsequently, to calculate the norm of the next section, the word q and the norm work area w are cleared (step S51).
0). When the norm calculation for all the nodes is completed, the control is moved to step S512 (step S511). In step S512, the value of d is set to 1 in order to advance the next document $ D.
Go forward. When the above-described norm calculation is completed for all the documents, the control proceeds to step S514 (step S513), and finally, the document norm table storage buffer 111
Is output to the external storage device 5 as a document norm table (step S514).

【００６５】次に、ステップＳ８の単語対応表作成処理
について説明する。図２３はこの単語対応表作成処理の
詳細を示すフローチャートである。Next, the word correspondence table creation processing in step S8 will be described. FIG. 23 is a flowchart showing details of the word correspondence table creation processing.

【００６６】外部記憶装置５に文書ノルム表が格納され
ると、メイン処理部１１ａは単語対応表作成部１１ｄヘ
制御を移す。単語対応表作成部１１ｄはステップＳ４で
指定された、検索対象・単語頻度表のＩＤ：ｄａ、被検
索対象・単語頻度表のＩＤ：ｄｂの値を受けとる。この
値は、単語対応表作成時に参照される単語頻度表の文書
ＩＤである（ステップＳ８０１）。When the document norm table is stored in the external storage device 5, the main processing section 11a transfers control to the word correspondence table creating section 11d. The word correspondence table creating unit 11d receives the values of the ID of the search target / word frequency table: da and the ID of the search target / word frequency table: db specified in step S4. This value is the document ID of the word frequency table that is referred to when creating the word correspondence table (step S801).

【００６７】次に、単語対応表作成部１１ｄは単語対応
表格納バッファ１１ｍ内に、図９に示したように、検索
対象文書の単語頻度表中で使われている単語数と同数の
単語対応表を作成し、被検索対象文書中の単語ＩＤの位
置情報欄をすべて初期値（例では（−１））で埋める
（ステップＳ８０２）。Next, as shown in FIG. 9, the word correspondence table creator 11d stores, in the word correspondence table storage buffer 11m, the same number of word correspondences as the number of words used in the word frequency table of the search target document. A table is created, and the position information fields of the word IDs in the search target document are all filled with initial values ((-1) in the example) (step S802).

【００６８】続いて、単語対応表作成部１１ｄは検索対
象・単語頻度表格納バッファ１１ｊから検索対象文書Ｉ
Ｄ＝ｄａの単語頻度表と、被検索対象・単語頻度表格納
バッファ１１ｋから被検索対象文書ＩＤ＝ｄｂの単語頻
度表を各々参照する。ここでは、検索対象・単語頻度表
と被検索対象・単語頻度表中で使われている単語のマッ
チングをとり、検索対象・単語頻度表中の単語が被検索
対象・単語頻度表中に含まれている場合はその単語の被
検索対象・単語頻度表中での位置情報を単語対応表格納
バッファ１１ｍの単語対応表に書き込み、検索対象・単
語頻度表中の単語が被検索対象・単語頻度表中に含まれ
ていない場合は「参照先なし」を示す任意の値（例では
（−１））を単語対応表に書き込む（ステップＳ８０
３）。Subsequently, the word correspondence table creator 11d stores the search target document I from the search target / word frequency table storage buffer 11j.
Reference is made to the word frequency table of D = da and the word frequency table of the search target document ID = db from the search target / word frequency table storage buffer 11k. Here, the words used in the search target / word frequency table are matched with the words used in the search target / word frequency table, and the words in the search target / word frequency table are included in the search target / word frequency table. If it is found, the position information of the word in the search target / word frequency table is written into the word correspondence table of the word correspondence table storage buffer 11m, and the word in the search target / word frequency table becomes the search target / word frequency table. If not included, an arbitrary value (in the example, (-1)) indicating "no reference destination" is written in the word correspondence table (step S80).
3).

【００６９】そして単語対応表作成部１１ｄは検索対象
・単語頻度表の単語数または、被検索対象・単語頻度表
の単語数のどちらかがなくなったらステップＳ８０５へ
制御を進める。それ以外は、ステップＳ８０３を繰り返
すためにステップＳ８０３へ制御を戻す（ステップＳ８
０４）。また、単語対応表作成部１１ｄは検索対象・単
語頻度表の単語数が被検索対象・単語頻度表の単語数よ
り多いときはステップＳ８０６へ制御を移し、それ以外
はステップＳ９へ制御を移す（ステップＳ８０５）。If there is no longer any of the number of words in the search target / word frequency table or the number of words in the search target / word frequency table, the word correspondence table creating unit 11d advances the control to step S805. Otherwise, control is returned to step S803 to repeat step S803 (step S8
04). When the number of words in the search target / word frequency table is larger than the number of words in the search target / word frequency table, the word correspondence table creating unit 11d transfers control to step S806, and otherwise transfers control to step S9 ( Step S805).

【００７０】検索対象・単語頻度表の単語数が被検索対
象・単語頻度表の単語数よりも多い場合、単語対応表作
成部１１ｄは、単語対応表格納バッファ１１ｍの単語対
応表に空欄ができている箇所に対してステップＳ８０３
と同様に「参照先なし」を示す任意の値を書き込む（ス
テップＳ８０６）。When the number of words in the search target / word frequency table is larger than the number of words in the search target / word frequency table, the word correspondence table creating unit 11d leaves a blank in the word correspondence table of the word correspondence table storage buffer 11m. Step S803
An arbitrary value indicating "no reference destination" is written in the same manner as (1) (step S806).

【００７１】次に、ステップＳ９の類似度算出処理につ
いて説明する。図２４、図２５はこの類似度算出処理の
詳細を示すフローチャートである。Next, the similarity calculation processing in step S9 will be described. FIG. 24 and FIG. 25 are flowcharts showing details of the similarity calculation processing.

【００７２】文書類似度ｓは、類似度：ｓ、各節の類似
度ｓｐ、節の数ｉ、各節の検索対象・単語出現頻度を要
素とするベクトルをｆａｐ、各節の被検索対象・単語出
現頻度を要素とするベクトルをｆｂｐとするとき、The document similarity s is a similarity: s, the similarity sp of each section, the number of sections i, the search target of each section, a vector having the word appearance frequency as an element, and the search target of each section When a vector having a word appearance frequency as an element is fbp,

【数２】で表される。(Equation 2) It is represented by

【００７３】単語対応表格納バッファ１１ｍに単語対応
表が格納されると、メイン処理部１１ａは類似度算出部
１１ｅへ制御を移す。類似度算出部１１ｅは作業用変数
のための領域１１ｐに、作業用変数として、参照中の
節：ｐ、節の総数：ｐｔ、参照中の単語ＩＤ：ｑ、単語
の総数：ｑｔ、検索対象文書の単語出現頻度：ｆａ、被
検索対象文書の単語出現頻度：ｆｂ、作業用：ｗ、検索
対象文書のノルム作業用：ｎｏｒｍａ、被検索対象文書
のノルム作業用：ｎｏｒｍｂ、文書類似度：ｓと、大き
さｐｔの配列として、内積：ｉｎｎｐｒｄ［ｐｔ］、節
ごとの類似度：ｓｐ［ｐｔ］を用意し、これらの変数に
初期値を代入する（ステップＳ９０１）。When the word correspondence table is stored in the word correspondence table storage buffer 11m, the main processing unit 11a transfers control to the similarity calculation unit 11e. The similarity calculating unit 11e stores, as working variables, in the area for working variables 11p, the section being referred to: p, the total number of sections: pt, the word ID being referred to: q, the total number of words: qt, and the search target. Word appearance frequency of document: fa, word appearance frequency of search target document: fb, work: w, norm work of search target document: norma, norm work of search target document: normb, document similarity: s And an inner product: innprd [pt] and a similarity per node: sp [pt] as an array of size pt, and assign initial values to these variables (step S901).

【００７４】次に、類似度算出部１１ｅはステップＳ４
で指定された、検索対象・単語頻度表の文書ＩＤ：ｄ
ａ、被検索対象・単語頻度表の文書ＩＤ：ｄｂの値を受
けとる。この文書ＩＤはステップＳ８で受け取った文書
ＩＤと同じものである（ステップＳ９０２）。Next, the similarity calculating section 11e proceeds to step S4
The document ID of the search target / word frequency table specified in the above: d
a, The value of the document ID: db of the search target / word frequency table is received. This document ID is the same as the document ID received in step S8 (step S902).

【００７５】続いて、類似度算出部１１ｅはステップＳ
８で単語対応表格納バッファ１１ｍに作成した単語対応
表を参照し、節ｐ、単語ｑにおける、検索対象・単語頻
度表中の位置情報を読みとる。この位置情報が「参照先
なし（−１）」である場合は、制御をステップＳ９０４
へ移し、それ以外の場合は制御をステップＳ９０５へ移
す（ステップＳ９０３）。Subsequently, the similarity calculating section 11e determines in step S
Referring to the word correspondence table created in the word correspondence table storage buffer 11m in step 8, the position information in the search target / word frequency table in the section p and the word q is read. If the position information is “no reference destination (−1)”, control is performed in step S904.
Otherwise, control is passed to step S905 (step S903).

【００７６】ステップＳ９０３において、参照先が「参
照先なし」のとき、類似度算出部１１ｅは節ｐ、単語ｑ
における、検索対象文書の単語出現頻度を格納する変数
ｆａに０を代入する（ステップＳ９０４）。また、ステ
ップＳ９０３において、参照先に「参照先なし」以外の
何らかの値が入っているとき、類似度算出部１１ｅは節
ｐ，単語ｑにおける検索対象文書の単語出現頻度ｆａ
に、その検索対象文書の単語出現頻度を、また被検索対
象文書の単語出現頻度ｆｂには、被検索対象文書の単語
出現頻度をそれぞれ代入する（ステップＳ９０５）。If the reference destination is “no reference destination” in step S 903, the similarity calculation unit 11 e sets the node p, the word q
In step S904, 0 is substituted for a variable fa for storing the word appearance frequency of the search target document. In step S903, when the reference destination includes any value other than “no reference destination”, the similarity calculation unit 11e determines the word appearance frequency fa of the search target document in the node p and the word q.
Then, the word appearance frequency of the search target document is substituted for the search target document, and the word appearance frequency of the search target document is substituted for the word appearance frequency fb of the search target document (step S905).

【００７７】検索対象文書の単語出現頻度ｆａおよび被
検索対象文書の単語出現頻度ｆｂがセットされると、類
似度算出部１１ｅは節ごとの内積を計算する。節の内積
ｉｎｎｐｒｄ［ｐ］はｉｎｎｐｒｄ［ｐ］←ｉｎｎｐｒｄ［ｐ］＋ｆａ×ｆｂで表される（ステップＳ９０６）。When the word appearance frequency fa of the search target document and the word appearance frequency fb of the search target document are set, the similarity calculator 11e calculates the inner product of each node. The inner product of nodes innprd [p] is expressed by innprd [p] ← innprd [p] + fa × fb (step S906).

【００７８】また、類似度算出部１１ｅは単語頻度表の
すべての節の内積を計算したならばステップＳ９０８
へ、計算途中であればステップＳ９０３へ制御を移す
（ステップＳ９０７）。さらに、類似度算出部１１ｅは
単語頻度表のすべての単語について計算を完了したなら
ばステップＳ９０９へ、計算途中であればステップＳ９
０３へ制御を移す（ステップＳ９０８）。If the similarity calculation unit 11e calculates the inner product of all the clauses of the word frequency table, the process proceeds to step S908.
If the calculation is in progress, control is transferred to step S903 (step S907). Further, the similarity calculation unit 11e proceeds to step S909 if the calculation has been completed for all the words in the word frequency table, and proceeds to step S9 if the calculation is in progress.
The control is shifted to 03 (step S908).

【００７９】以上までが節ごとの類似度の分子に関する
計算である。次に分母に関する計算を行う。The above is the calculation of the numerator of the similarity for each node. Next, calculation regarding the denominator is performed.

【００８０】まず、類似度算出部１１ｅはステップＳ５
で計算した文書ノルム表を参照し（ステップ９０９）、
文書ＩＤ＝ｄａ、節ｐにおけるノルムを変数ｎｏｒｍａ
に、文書ＩＤ＝ｄｂ、節ｐにおけるノルムを変数ｎｏｒ
ｍｂにそれぞれ代入する（ステップ９１０）。First, the similarity calculating section 11e determines in step S5
With reference to the document norm table calculated in step (909),
Document ID = da, norm in section p is set to variable norma
And the norm in the document ID = db and the section p is set to the variable nor
mb (step 910).

【００８１】次に、類似度算出部１１ｅは節ｐの類似度
ｓｐ［ｐ］を求める。節ｐの類似度ｓｐ［ｐ］は、ステ
ップＳ９０６で計算したｉｎｎｐｒｄ［ｐ］と、ステッ
プＳ９１０のｎｏｒｍａ、ｎｏｒｍｂから、ｓｐ［ｐ］←ｉｎｎｐｒｄ［ｐ］／（ｎｏｒｍａ×ｎｏ
ｒｍｂ）で求められる（ステップＳ９１１）。Next, the similarity calculating section 11e obtains the similarity sp [p] of the node p. The similarity sp [p] of the node p is calculated from spnprd [p] calculated in step S906 and norma and normb in step S910, as follows: sp [p] ← inprd [p] / (norma × no)
rmb) (step S911).

【００８２】続いて類似度算出部１１ｅは節ｐのチェッ
クを行い、全ての節の類似度が求められていないときは
制御をステップＳ９０１へ戻し、全ての節の類似度が求
められたときは制御をステップＳ９１３へ進める（ステ
ップＳ９１２）。Subsequently, the similarity calculating unit 11e checks the node p. If the similarities of all the nodes have not been obtained, the control returns to step S901. If the similarities of all the nodes have been obtained, the control returns to step S901. The control proceeds to step S913 (step S912).

【００８３】ここで文書の類似度を求める。ここで、総
節数：ｐｔ、節：ｐ、節ｐの類似度：ｓｐ［ｐ］とする
と、文書類似度ｓはHere, the similarity of the document is obtained. Here, assuming that the total number of sections is pt, the section is p, and the similarity of the section p is sp [p], the document similarity s is

【数３】により求められる（ステップＳ９１３）。(Equation 3) (Step S913).

【００８４】最後に類似度算出部１１ｅは、ステップＳ
９１３で求めた文書類似度ｓを被検索対象となった文書
ＩＤと共に外部記憶装置５へ類似度算出結果として出力
する（ステップＳ９１４）。Finally, the similarity calculating section 11e determines in step S
The document similarity s obtained in 913 is output as a similarity calculation result to the external storage device 5 together with the document ID of the search target (step S914).

【００８５】次に、ステップＳ１３の出力編集処理につ
いて説明する。図２６はこの出力編集処理の詳細を示す
フローチャートである。Next, the output editing process in step S13 will be described. FIG. 26 is a flowchart showing details of the output editing process.

【００８６】出力編集部１１ｆは外部記憶装置５に格納
された、複数に分割された類似度算出結果を１つの出力
編集結果として出力する。ここで出力編集部１１ｆの役
割を説明する。The output editing unit 11f outputs a plurality of divided similarity calculation results stored in the external storage device 5 as one output editing result. Here, the role of the output editing unit 11f will be described.

【００８７】本方式では図１４（ｂ）に示したように、
検索対象文書ａ１と各被検索対象文書ｂ１〜ｂｎとの類
似度算出結果はｏ11とｏ12の２ファイルとなる。このた
め複数に分割された類似度算出結果を１つの出力編集結
果として統合するための処理が必要となる。In this method, as shown in FIG.
The similarity calculation results of the search target document a1 and the search target documents b1 to bn are two files o11 and o12. For this reason, a process for integrating the plurality of divided similarity calculation results into one output editing result is required.

【００８８】まず、全ての類似度算出結果が外部記憶装
置５に格納されると、メイン処理部１１ａは出力編集部
１１ｆへ制御を移す。出力編集部１１ｆは分割された類
似度算出結果を外部記憶装置５から読み込んで類似度算
出結果格納バッファ１１ｎへ書き込む（ステップＳ１３
１）。First, when all the similarity calculation results are stored in the external storage device 5, the main processing section 11a transfers control to the output editing section 11f. The output editing unit 11f reads the divided similarity calculation result from the external storage device 5 and writes it into the similarity calculation result storage buffer 11n (step S13).
1).

【００８９】次に、出力編集部１１ｆは各類似度算出結
果に付された分割番号（類似度算出結果を２ファイルに
分割した場合、最初の類似度算出結果の分割番号は１、
次の類似度算出結果の分割番号は２となる。）ｒのチェ
ックを行う。総分割数がｄｒであるとき、分割番号ｒ＜
ｄｒの間は制御をステップＳ１３１へ戻し、それ以外は
制御をＳ１３３へ進める（ステップＳ１３２）。図１４
の例ではｏ11→ｏ12の順番で類似度算出結果が読み込ま
れる。Next, the output editing unit 11f determines the division number assigned to each similarity calculation result (when the similarity calculation result is divided into two files, the division number of the first similarity calculation result is 1,
The division number of the next similarity calculation result is 2. ) Check r. When the total number of divisions is dr, the division number r <
During dr, the control returns to step S131, and otherwise, the control proceeds to S133 (step S132). FIG.
In the example, the similarity calculation result is read in the order of o11 → o12.

【００９０】分割された類似度算出結果が類似度算出結
果格納バッファ１１ｎに全て格納されと、出力編集部１
１ｆは文書類似度をキーにソートを行い（ステップＳ１
３３）、出力編集結果（図１１）を出力編集結果格納バ
ッファ１１ｏに書き込み後、外部出力装置５に出力する
（ステップＳ１３４）。When all of the divided similarity calculation results are stored in the similarity calculation result storage buffer 11n, the output editing unit 1
1f performs sorting using the document similarity as a key (step S1).
33), the output editing result (FIG. 11) is written to the output editing result storage buffer 11o, and then output to the external output device 5 (step S134).

【００９１】本実施形態では、図１４（ｃ）の類似度算
出結果ｏ11に相当する類似度算出結果を図１０の「ａ１
＃１」、ｏ12に相当する類似度算出結果を図１０の「ａ
１＃２」とするとき、図１１に示すような出力編集結果
が得られる。すなわち、類似度が高い順に各情報が並び
換えられる。In this embodiment, the similarity calculation result corresponding to the similarity calculation result o11 in FIG.
# 1 ”, the similarity calculation result corresponding to o12 is represented by“ a ”in FIG.
1 # 2 ", an output editing result as shown in FIG. 11 is obtained. That is, the pieces of information are rearranged in descending order of similarity.

【００９２】さらに、出力編集部１１ｆは外部出力装置
５内から、ステップＳ１３１、Ｓ１３２で読み込んだ類
似度算出結果を削除する（ステップＳ１３５）。最後
に、すべての類似度算出結果が出力編集処理されたかを
チェックし。出力編集処理が処理中のときは制御をステ
ップＳ１３１へ戻し、それ以外は制御をＳ１４へ進める
（ステップＳ１３６）。Further, the output editing unit 11f deletes the similarity calculation result read in steps S131 and S132 from the external output device 5 (step S135). Finally, check whether all the similarity calculation results have been output edited. If the output editing process is in progress, the control returns to step S131, otherwise the control proceeds to S14 (step S136).

【００９３】このように本実施形態の類似文書検索装置
は、文書データ中の特定の単語（単語抽出表の単語）の
出現頻度を考慮した、従来よりも信頼性に優れた類似文
書検索が可能となるとともに、同時に複数の検索対象文
書と複数の被検索対象文書を対象とした類似文書検索を
高効率にかつ高精度に行うことが可能となる。As described above, the similar document search apparatus of the present embodiment can perform a similar document search with higher reliability than the conventional one in consideration of the frequency of occurrence of a specific word (word in the word extraction table) in the document data. At the same time, a similar document search for a plurality of search target documents and a plurality of search target documents can be performed with high efficiency and high accuracy.

【００９４】また、外部記憶装置５に記憶された各文書
データについて単語頻度表と文書ノルム表を予め作成し
ておき、検索対象および被検索対象の各文書が指定され
たところで各文書の単語頻度表および文書ノルム表を参
照して文書データ間の類似度計算を行うことで、連続し
て複数の検索対象文書と複数の被検索対象文書との類似
度を算出する場合、すなわち１つの検索対象文書と各被
検索対象文書との類似度計算を行った後、次の１つの検
索対象文書と各被検索対象文書との類似度計算を行う場
合に、単語頻度表の作成やノルムの計算を重複して行う
必要がなくなり、高速な処理が可能となる。In addition, a word frequency table and a document norm table are created in advance for each document data stored in the external storage device 5, and the word frequency of each document is specified when each of the documents to be searched and searched is specified. When the similarity between document data is calculated by referring to the table and the document norm table, the similarity between a plurality of search target documents and a plurality of search target documents is continuously calculated, that is, one search target After calculating the similarity between a document and each document to be searched and then calculating the similarity between the following one document to be searched and each document to be searched, it is necessary to create a word frequency table and calculate the norm. There is no need to perform the processing repeatedly, and high-speed processing can be performed.

【００９５】さらに、本実施形態においては、検索対象
文書と被検索対象文書との類似度算出において、各単語
頻度表のベクトルデータの内積計算に必要な、検索対象
文書の単語頻度表と被検索対象文書の単語頻度表との間
での共通単語についての出現頻度情報を単語対応表を参
照することによって一意に得られるので、より一層の処
理の高速化を図ることができる。Further, in this embodiment, in calculating the similarity between the search target document and the search target document, the word frequency table of the search target document and the search target document required for the inner product of the vector data of each word frequency table are calculated. Since the appearance frequency information about the common word with the word frequency table of the target document can be uniquely obtained by referring to the word correspondence table, it is possible to further speed up the processing.

【００９６】さらに、本実施形態においては、図１４に
て説明したように、検索対象文書と被検索対象文書の各
単語頻度表を複数個のグループ単位で外部記憶装置５か
らメモリ上に呼び出して、可能な限りの組み合わせで検
索対象文書と被検索対象文書との類似度計算を行い、引
き続き次のグループの検索対象文書と被検索対象文書の
各単語頻度表をメモリ上に呼び出して（書き替えて）同
様に類似度計算を行うようにしたことで、外部記憶装置
５のアクセス回数を減らすことができる。Further, in this embodiment, as described with reference to FIG. 14, each word frequency table of the search target document and the search target document is called from the external storage device 5 into the memory in units of a plurality of groups. , Calculate the similarity between the document to be searched and the document to be searched in as many combinations as possible, and successively call the word frequency tables of the documents to be searched and the document to be searched in the next group into the memory (rewrite T) Similarly, by performing the similarity calculation, the number of accesses to the external storage device 5 can be reduced.

【００９７】例えば、検索対象文書の単語頻度表の数を
ｍ、被検索対象文書の単語頻度表の数をｎ、分割数を
ｖ、節を１としたとき、本方式では、単語頻度表読み込
み時にｎｖ＋ｍ回，類似度算出結果の書き込み時にｖｍ
回、また、出力編集時の類似度算出結果の読み込み時に
ｖｍ回、出力編集結果の書き出しにｍ回の、合計２ｍ
（ｖ＋１）＋ｎｖ回のアクセスが発生する。これは、計
算オーダとしてＯ（ｍ＋ｎ）と表すことができる。For example, if the number of word frequency tables in the document to be searched is m, the number of word frequency tables in the document to be searched is n, the number of divisions is v, and the clause is 1, this method reads the word frequency table. Sometimes nv + m times, when writing the similarity calculation result, vm
Times, and vm times when reading the similarity calculation result during output editing and m times when writing the output editing result, for a total of 2 m
(V + 1) + nv accesses occur. This can be represented as O (m + n) as a calculation order.

【００９８】比較例として、図１７に示すように、検索
対象文書の単語頻度表と被検索対象文書の単語頻度表を
各々１文書ずつメモリ上に呼び出して類似度計算を行う
方式を考える。この場合、外部記憶装置５のアクセス回
数は、単語頻度表の読み込み時にｎｍ回、類似度算出結
果の書き出し時にｍ回の、合計ｍ（ｎ＋１）回となる。
これは、計算オーダとしてＯ（ｍｎ）なり、本方式の外
部記憶装置５のアクセス回数が非常に少ないことが言え
る。As a comparative example, as shown in FIG. 17, a method of calculating the similarity by calling the word frequency table of the search target document and the word frequency table of the search target document one by one in the memory will be considered. In this case, the number of accesses to the external storage device 5 is nm times when reading the word frequency table and m times when writing the similarity calculation result, that is, a total of m (n + 1) times.
This is O (mn) as the calculation order, and it can be said that the number of accesses to the external storage device 5 of this method is extremely small.

【００９９】これを具体的な値で示すと、検索対象文書
の単語頻度表の数を１００、被検索対象文書の単語頻度
表の数を１００、分割数を２としたとき、本方式では８
００回、比較例の方式では１０，１００回のファイルア
クセスが発生する。When this is expressed by specific values, when the number of word frequency tables of the document to be searched is 100, the number of word frequency tables of the document to be searched is 100, and the number of divisions is 2, 8
In the method of the comparative example, file access occurs 10,100 times.

【０１００】またさらに、本実施形態においては、単語
頻度表を複数の節から構成する構造にしたことで、例え
ば、文書タイトル概要、本文といった節ごとに類似度を
求めることができ、より精度の高い類似度検索結果を得
られる。Furthermore, in the present embodiment, the word frequency table is structured to include a plurality of sections, so that the similarity can be obtained for each section such as a document title outline and a body, and more accurate. High similarity search results can be obtained.

【０１０１】[0101]

【発明の効果】以上説明したように請求項１および請求
項３記載の発明によれば、文書データ中の検索条件単語
の出現頻度を考慮した類似文書検索が可能となり、ま
た、同時に複数の検索対象文書と複数の被検索対象文書
を対象とした類似文書検索を高効率にかつ高精度に行う
ことが可能となる。As described above, according to the first and third aspects of the present invention, it is possible to perform a similar document search in consideration of the frequency of occurrence of a search condition word in document data. A similar document search for a target document and a plurality of search target documents can be performed with high efficiency and high accuracy.

【０１０２】また、請求項１および請求項３の発明で
は、格納手段に格納された各文書データについて単語頻
度表と文書ノルム表を予め作成しておき、検索対象およ
び被検索対象の文書データが指定されたところで、該当
する各文書データの単語頻度表および文書ノルム表から
文書データ間の類似度を求めることで、連続して複数の
検索対象文書と複数の被検索対象文書を対象とした類似
文書検索を行う場合でも単語頻度表の作成やノルムの計
算を重複して行う必要がなくなり、高速な処理が可能と
なる。According to the first and third aspects of the present invention, a word frequency table and a document norm table are created in advance for each document data stored in the storage means, and the document data to be searched and to be searched are By specifying the similarity between the document data from the word frequency table and the document norm table of each applicable document data at the specified location, the similarity for multiple search target documents and multiple search target documents in succession Even when performing a document search, there is no need to duplicately create the word frequency table and calculate the norm, and high-speed processing can be performed.

【０１０３】さらに、請求項２および請求項４記載の発
明によれば、各文書データ間の類似度算出において、各
単語頻度表のベクトルデータの内積計算に必要な、検索
対象文書の単語頻度表と被検索対象文書の単語頻度表と
の間での共通単語についての出現頻度情報を単語対応表
を参照することによって一意に得られるので、より一層
の処理の高速化を図ることができる。Further, according to the second and fourth aspects of the present invention, in calculating the similarity between document data, the word frequency table of the document to be searched, which is necessary for calculating the inner product of the vector data of each word frequency table Since the appearance frequency information about the common word between the search target document and the word frequency table of the search target document can be uniquely obtained by referring to the word correspondence table, the processing speed can be further increased.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である類似文書検索装
置のハードウェア構成を示すブロック図FIG. 1 is a block diagram showing a hardware configuration of a similar document search device according to an embodiment of the present invention.

【図２】図１のメモリの構成を示す図FIG. 2 is a diagram showing a configuration of a memory in FIG. 1;

【図３】図２のメモリ内のプログラム部およびバッファ
部の構成を示す図FIG. 3 is a diagram showing a configuration of a program unit and a buffer unit in the memory of FIG. 2;

【図４】文書データの構造を示す図FIG. 4 is a diagram showing a structure of document data.

【図５】単語ＩＤ−単語表の構造を示す図FIG. 5 is a diagram showing a structure of a word ID-word table.

【図６】検索対象文書の単語頻度表を示す図FIG. 6 is a diagram showing a word frequency table of a search target document.

【図７】被検索対象文書の単語頻度表を示す図FIG. 7 is a diagram showing a word frequency table of a document to be searched;

【図８】文書ノルム表を示す図FIG. 8 shows a document norm table.

【図９】単語対応表を示す図FIG. 9 is a diagram showing a word correspondence table.

【図１０】類似度算出結果を示す図FIG. 10 is a diagram showing a similarity calculation result;

【図１１】類似度算出結果の出力編集結果を示す図FIG. 11 is a diagram showing an output editing result of a similarity calculation result;

【図１２】外部記憶装置内における文書データおよび単
語頻度表の格納形式を示す図FIG. 12 is a diagram showing a storage format of document data and a word frequency table in an external storage device.

【図１３】単語抽出表を示す図FIG. 13 shows a word extraction table.

【図１４】図１６に示すメイン処理のステップＳ６から
Ｓ１２までの処理を具体的に説明するための図FIG. 14 is a diagram for specifically explaining the processing from steps S6 to S12 of the main processing shown in FIG. 16;

【図１５】単語抽出処理の全体的な手順を示すフローチ
ャートFIG. 15 is a flowchart showing the overall procedure of word extraction processing;

【図１６】メイン処理の全体的な手順を示すフローチャ
ートFIG. 16 is a flowchart showing the overall procedure of a main process.

【図１７】図１６に示すメイン処理のステップＳ６から
Ｓ１２までの処理に対する比較例を示す図FIG. 17 is a view showing a comparative example with respect to the processing from steps S6 to S12 of the main processing shown in FIG. 16;

【図１８】類似文書検索結果である文書一覧画面を示す
図FIG. 18 is a view showing a document list screen as a similar document search result.

【図１９】類似文書検索結果である文書画面を示す図FIG. 19 is a diagram showing a document screen as a similar document search result.

【図２０】単語抽出処理の詳細を示すフローチャートFIG. 20 is a flowchart showing details of a word extraction process;

【図２１】初期化処理の詳細を示すフローチャートFIG. 21 is a flowchart showing details of initialization processing;

【図２２】ノルム計算処理の詳細を示すフローチャートFIG. 22 is a flowchart showing details of a norm calculation process;

【図２３】単語対応表作成処理の詳細を示すフローチャ
ートFIG. 23 is a flowchart showing details of a word correspondence table creation process;

【図２４】類似度算出処理の詳細を示すフローチャートFIG. 24 is a flowchart showing details of a similarity calculation process;

【図２５】図２４に続いて類似度算出処理の詳細を示す
フローチャートFIG. 25 is a flowchart showing details of a similarity calculation process subsequent to FIG. 24;

【図２６】出力編集処理の詳細を示すフローチャートFIG. 26 is a flowchart showing details of output editing processing;

【図２７】従来の類似文書検索方式を説明するための図FIG. 27 is a view for explaining a conventional similar document search method.

[Explanation of symbols]

１……入力装置２……表示装置３……制御装置４……メモリ５……外部記憶装置１０ａ……単語抽出部１０ｂ……入力文書データ格納バッファ１０ｃ……単語ＩＤ−単語表格納バッファ１０ｄ……単語頻度表格納バッファ１０ｅ……単語抽出表格納１１ａ……メイン処理部１１ｂ……初期化部１１ｃ……ノルム計算部１１ｄ……単語対応表作成部１１ｅ……類似度算出部１１ｆ……出力編集部１１ｇ……文書一覧表示部１１ｈ……文書選択部１１ｉ……文書内容表示部１１ｊ……検索対象・単語頻度表格納バッファ１１ｋ……被検索対象・単語頻度表格納バッファ１１ｌ……文書ノルム表格納バッファ１１ｍ……単語対応表格納バッファ１１ｎ……類似度算出結果格納バッファ１１ｏ……出力編集結果格納バッファ DESCRIPTION OF SYMBOLS 1 ... Input device 2 ... Display device 3 ... Control device 4 ... Memory 5 ... External storage device 10a ... Word extraction part 10b ... Input document data storage buffer 10c ... Word ID-word table storage buffer 10d ... Word frequency table storage buffer 10e... Word extraction table storage 11a... Main processing unit 11b... Initialization unit 11c... Norm calculation unit 11d. Output editing unit 11g Document list display unit 11h Document selection unit 11i Document content display unit 11j Search target / word frequency table storage buffer 11k Reference target / word frequency table storage buffer 11l Document Norm table storage buffer 11m... Word correspondence table storage buffer 11n... Similarity calculation result storage buffer 11o... Output editing result storage buffer

───────────────────────────────────────────────────── フロントページの続き (72)発明者久保田直秀東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者中本幸夫東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者仁科卓哉東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 ──────────────────────────────────────────────────続き Continuing from the front page (72) Inventor Naohide Kubota 1381-1, Shinmachi, Omachi, Tokyo Toshiba Computer Engineering Co., Ltd. (72) Yukio Nakamoto 1381-1, Shinmachi, Ome, Tokyo Toshiba Computer Data Engineering Co., Ltd. (72) Inventor Takuya Nishina 1381 Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd.

Claims

[Claims]

1. A storage unit for storing a plurality of document data, and a word frequency for generating a word frequency table by obtaining an appearance frequency for each preset search condition word for each document data stored in the storage unit A table creation unit, and for each document data stored in the storage unit, calculate a norm of a one-dimensional vector having an appearance frequency for each search condition word in the word frequency table created by the word frequency table creation unit as an element Document norm table creating means for creating a document norm table, and specifying means for specifying each document data of a search target and a search target from the document data stored in the storage means, A similarity calculating means for calculating a similarity between the respective document data based on the word frequency table and the document norm table.

2. A storage unit for storing a plurality of document data, and a word frequency for creating a word frequency table by obtaining an appearance frequency for each preset search condition word for each document data stored in the storage unit A table creation unit, and for each document data stored in the storage unit, calculate a norm of a one-dimensional vector having an appearance frequency for each search condition word in the word frequency table created by the word frequency table creation unit as an element Document norm table creating means for creating a document norm table, and specifying means for specifying each document data of a search target and a search target from the document data stored in the storage means, Word correspondence table creating means for creating a word correspondence table indicating the registered positional relationship of common words between the respective word frequency tables created by the word frequency table creating means for each document data A similar document search device comprising: a similarity calculating unit that calculates a similarity between respective document data specified by the specifying unit based on the word frequency table, the document norm table, and the word correspondence table. .

3. A step of obtaining a frequency of appearance for each preset search condition word for each individual document data stored in the document database to create a word frequency table, and an individual document stored in the document database. A step of calculating a norm of a one-dimensional vector having an appearance frequency for each search condition word in the created word frequency table as an element for each data to create a document norm table; and a document stored in the document database. A step of designating each document data of a search target and a search target from data; and a step of calculating a similarity between the designated document data based on the word frequency table and the document norm table. A similar document search method characterized in that:

4. A step of generating a word frequency table by obtaining an appearance frequency for each preset search condition word for each document data stored in the document database, and each document stored in the document database. A step of calculating a norm of a one-dimensional vector having an appearance frequency for each search condition word in the created word frequency table as an element for each data to create a document norm table; and a document stored in the document database. A step of designating each of the document data to be searched and the object to be searched from the data; And calculating the similarity between the specified document data based on the word frequency table, the document norm table, and the word correspondence table. Similar document retrieval method according to claim.