JPH07105237A

JPH07105237A - Method and device for index generation, and document retrieval device

Info

Publication number: JPH07105237A
Application number: JP5253032A
Authority: JP
Inventors: Yuji Sugano; 野祐司菅
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-10-08
Filing date: 1993-10-08
Publication date: 1995-04-21
Anticipated expiration: 2015-08-28
Also published as: JP3081093B2

Abstract

PURPOSE:To enable high-speed all sentence retrieval under complicated retrieval conditions. CONSTITUTION:The index generating device has a character appearance frequency extent calculating means 104, a two-character successive appearance extent calculating means 105, and a three-character successive appearance extent calculating means 106 which analyze the extent of appearance of a character or character string of high frequency, and also has means 107-109 which group characters of low frequency, two successive characters, and three successive characters respectively according to the respective extents of appearance frequencies; and an index type for generating small-sized precise index data and index data are generated at a high speed according to the type. The document retrieval device has the index data which are thus generated, an index collation condition generating means, an index collating means, and a whole scanning character string collating means, and can perform high-speed retrieving operation utilizing the index data even under complicated retrieval conditions of a regular expression, etc.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、電子計算機を応用した
文書検索システムや文書編集システムにおける文書中か
ら文字列等を検索するための索引の作成方法およびその
装置と文書検索装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for creating an index for searching a character string or the like from a document in a document search system or a document editing system to which an electronic computer is applied, and an apparatus and a document search apparatus. .

【０００２】[0002]

【従来の技術】近年、ワードプロセッサやパーソナルコ
ンピューターの普及、コンピュータの記憶装置の容量の
増大、コンピュータによる文字認識の実用化等に伴い、
文書中の全ての文字情報を蓄積した全文データベースが
多くなってきた。このため、大量の文字情報を蓄積し、
必要に応じて文書情報を検索する全文データベース検索
システムに対する関心が高まってきている。2. Description of the Related Art In recent years, with the spread of word processors and personal computers, the increase in storage capacity of computers, and the practical use of character recognition by computers,
There is a growing number of full-text databases that store all textual information in documents. Therefore, a large amount of character information is accumulated,
There is increasing interest in a full-text database search system that searches document information as needed.

【０００３】従来の文書データベースシステムでは、文
書を検索する際の鍵として、文書毎に人手により付与さ
れたキーワードを利用するキーワード検索方式が一般的
であった。しかし、キーワード付け作業が蓄積文書の増
加に間に合わない、時間が経過するとキーワードが陳腐
化する、キーワード付けを行った者と検索する者とのキ
ーワードの解釈の相違により検索もれが生じる、などの
問題点があった。このような背景から、近年、全文検索
（フルテキストサーチ）と呼ばれる文書検索方式が注目
されている。In a conventional document database system, a keyword search method is generally used in which a keyword manually assigned to each document is used as a key for searching a document. However, the keyword addition work cannot keep up with the increase in accumulated documents, the keywords become obsolete over time, and the search omission occurs due to the difference in the interpretation of the keyword between the person who added the keyword and the person who searches. There was a problem. From such a background, a document search method called a full-text search (full-text search) has recently been attracting attention.

【０００４】全文検索は、文書データのほかには補助的
な情報を持たずに、検索毎に文書データを全文走査する
「フルテキストスキャン」方式と、検索に先だって、文
書データ中に出現する文字あるいは文字列の情報を高速
に取り出せるような索引情報を自動的に作成しておい
て、検索時にこの索引を検索する方式の２種類に大別さ
れる。The full-text search is a "full-text scan" method in which full-text scanning is performed on the document data for each search without any auxiliary information other than the document data, and a character appearing in the document data prior to the search. Alternatively, it is roughly classified into two types, that is, a method of automatically creating index information that can extract character string information at a high speed and searching this index at the time of searching.

【０００５】このうち、フルテキストスキャン方式は、
原文書以外の情報を用いないので、記憶容量が少なくて
済むとともに文書データの更新直後でも即座に検索でき
る点、および正規表現等の文字列パターンや論理条件を
含む複雑な検索条件の場合や検索結果が多い場合でも、
検索時間がほぼ一定である点が長所であるが、文書デー
タの全てを走査するため、索引方式に比べて検索速度が
遅いという問題が指摘されている。Of these, the full text scanning method is
Since information other than the original document is not used, the storage capacity is small and it is possible to search immediately even after updating the document data, and in the case of complicated search conditions including character string patterns such as regular expressions and logical conditions Even if there are many results,
Although the advantage is that the search time is almost constant, it has been pointed out that the search speed is slower than that of the index method because all document data is scanned.

【０００６】一方、索引方式は、一般にフルテキストス
キャン方式よりも検索速度が速く、索引の作成方法によ
っては、検索速度が文書量にほとんど依存しないという
利点があるが、索引情報の容量が大きいこと、索引を作
成する時間が長いこと、検索条件が複雑な場合や検索結
果が多い場合に検索速度が低下すること等の問題が指摘
されている。On the other hand, the index method generally has a faster search speed than the full-text scan method, and depending on the method of creating the index, the search speed has little advantage that it depends on the amount of documents, but the index information has a large capacity. It has been pointed out that problems such as a long index creation time and a low search speed when the search conditions are complicated or the number of search results are large.

【０００７】このような従来の全文検索ための文書検索
方法と索引作成方法とその特徴は、「ＡｃｃｅｓｓＭ
ｅｔｈｏｄｓｆｏｒＴｅｘｔ］（Ｏｈｒｉｓｔｏｓ
Ｆａｌｏｕｔｓｏｓ，ＣｏｍｐｕｔｉｎｇＳｕｒｖ
ｅｙｓ，Ｖｏｌ．１７，Ｎｏ．１，Ｍａｒｃｈ１９８
５）等の論文や、「テキスト検索プロセッサ」（高橋恒
介著、電子情報通信学会刊）等の成書に詳細な説明がな
されている。The conventional document search method and index creation method for full-text search and their characteristics are described in "Access M".
methods for Text] (Ohristos
Faloutos, Computing Surv
eys, Vol. 17, No. 1, March 198
Detailed explanations are given in papers such as 5) and in textbooks such as "Text Search Processor" (written by Tsunesuke Takahashi, published by The Institute of Electronics, Information and Communication Engineers).

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、上記の
論文、成書で紹介されている従来の方法では、索引を用
いないと検索速度が上がらず、一方索引を用いると、索
引の作成・更新時間がかかる上に、索引データの容量が
大きくなり、正規表現などの複雑な文字列パターンでの
検索にも時間がかかるという課題があった。However, in the conventional methods introduced in the above papers and books, the search speed cannot be improved without using the index. On the other hand, when the index is used, the index creation / update time is increased. In addition to the above, there is a problem that the amount of index data becomes large and it takes time to search for a complicated character string pattern such as a regular expression.

【０００９】本発明は、上記の従来技術の課題を解決す
るもので、作成・更新時間が短く、容量が小さく、正規
表現などの複雑な文字列パターンでの近似検索も高速で
行なうことのできる索引作成方法とその装置および作成
された索引データとフルテキストスキャンとを組み合わ
せた検索速度の速い文書検索装置を提供することを目的
とする。The present invention solves the above-mentioned problems of the prior art, and has a short creation / update time, a small capacity, and an approximate search with a complicated character string pattern such as a regular expression can be performed at high speed. An object of the present invention is to provide an index creating method and its device, and a document searching device which combines the created index data and full-text scanning and has a high search speed.

【００１０】[0010]

【課題を解決するための手段】上記目的を達成するため
に、本発明による索引作成方法は、検索対象文書データ
に関する索引データを作成する際に、まずサンプル文書
データの文字および文字列の出現を統計的に調べて索引
データを作成する際の共通情報となる索引型式データを
作成し、この索引型式データの型式に従って索引データ
を作成するようにしたものである。In order to achieve the above object, the index creating method according to the present invention first detects the appearance of characters and character strings in the sample document data when creating the index data related to the search target document data. The index type data, which is common information when statistically checking and creating the index data, is created, and the index data is created according to the type of the index type data.

【００１１】また、本発明によると索引作成装置は、サ
ンプル文書データ中のある１文字の出現の度合を統計的
に調べる文字出現頻度算定手段と、前回調べた文字の出
現の度合が予め定められた値よりも高い場合に、前回調
べた文字の全てを含むＮ文字（Ｎは２、３、４、・・・
の自然数）の文字列についての出現の度合を統計的に調
べる複数のＮ文字連続出現頻度算定手段と、サンプル文
書データ中の文字または文字列の出現の度合に応じて文
字出現頻度算定手段および複数のＮ文字連続出現頻度算
定手段の出力から検索対象文書データに関する索引デー
タを作成する際の共通情報となる索引型式データを作成
する索引型式出力手段とを備えたものである。Further, according to the present invention, the index creating apparatus has a character appearance frequency calculating means for statistically checking the degree of appearance of one character in the sample document data, and the degree of appearance of the previously examined character. If the value is higher than the specified value, N characters (N is 2, 3, 4, ...
Natural number), a plurality of N character consecutive appearance frequency calculation means for statistically checking the appearance degree of a character string, and a character appearance frequency calculation means and a plurality of character appearance frequency calculation means according to the appearance degree of a character or a character string in sample document data. And an index type output unit for creating index type data which is common information when creating index data relating to the search target document data from the output of the N character consecutive appearance frequency calculation unit.

【００１２】さらに、本発明による索引作成装置は、サ
ンプル文書データ中の文字または文字列をその出現の度
合に応じてグループ化する複数のグループ化手段を備
え、索引型式出力手段が、各グループ化手段から出力さ
れたグループ化情報を基に各グループの通し番号と所属
する文字または文字列との対応表を出力するようにした
ものである。Further, the index creating device according to the present invention comprises a plurality of grouping means for grouping the characters or character strings in the sample document data according to the degree of their appearance, and the index type output means groups each group. Based on the grouping information output from the means, a correspondence table of serial numbers of each group and the characters or character strings to which they belong is output.

【００１３】さらに、本発明による索引作成装置は、検
索対象文書データに関する索引データを作成する際に用
いる検索文字数を上記構成の索引作成装置から出力され
た索引型式データに従って決定する文字連続数算定手段
と、文字連続数算定手段により決定された文字数と上記
構成の索引作成装置から出力された索引型式データとか
ら対応するグループ番号を算定するグループ番号算定手
段と、グループ番号算定手段から出力されたグループ番
号からそれぞれの文書レコードの索引データを作成する
索引情報蓄積出力手段とを備えたものである。Further, the index creating device according to the present invention is a character continuous number calculating means for determining the number of search characters used when creating the index data relating to the document data to be searched according to the index type data output from the index creating device having the above-mentioned configuration. And a group number calculation means for calculating the corresponding group number from the number of characters determined by the character continuation number calculation means and the index type data output from the index creation device having the above configuration, and a group output from the group number calculation means. Index information storage / output means for creating index data of each document record from the number.

【００１４】また、本発明による文書検索装置は、文字
列パターンを含む検索条件を入力する検索入力手段と、
検索条件から上記構成の索引作成装置から出力された索
引データを照合するための文字または文字列のＡＮＤ／
ＯＲ木を作成する索引照合条件作成手段と、索引データ
と索引照合条件作成手段が作成したＡＮＤ／ＯＲ木との
照合を行なう索引照合手段と、照合が成功した場合に、
検索対象文書データの対応する部分を検索条件入力手段
から入力された文字列パターンを含む検索条件と照合
し、照合の成功した部分を最終的な検索結果として出力
する全文走査文字列照合手段とを備えたものである。Further, the document search device according to the present invention comprises search input means for inputting search conditions including a character string pattern,
AND / of characters or character strings for collating the index data output from the index creating device having the above-mentioned configuration from the search condition
An index matching condition creating means for creating an OR tree, an index matching means for matching the index data with the AND / OR tree created by the index matching condition creating means, and when the matching is successful,
A full-text scanning character string collating unit that collates the corresponding portion of the search target document data with the retrieval condition including the character string pattern input from the retrieval condition input unit and outputs the successfully collated portion as the final retrieval result. Be prepared.

【００１５】[0015]

【作用】本発明は、上記構成によって、検索対象文書デ
ータに関する索引の型式が統計的に検索対象文書データ
に適合した、容量の小さい索引データを、検索対象文書
データの統計的性質を調べることなしに高速に作成し、
また、サンプル文書中における出現の度合が、予め定め
られた値以下すなわち絞り込み率以下である低頻度文字
については、索引型式出力手段が検索対象文書データに
おける１文字の出現を記録するための索引データの型式
を指示し、サンプル文書中における出現の度合が絞り込
み率よりも高い高頻度文字については、高頻度文字に属
するＮ文字、すなわちまず初めに２つの文字からなる２
文字連続のサンプル文書データ中における出現の度合を
２文字連続出現頻度算定手段が統計的に調べ、サンプル
文書データ中における出現の度合が絞り込み率以下であ
る低頻度２文字連続については、索引型式出力手段が検
索対象文書データにおける２文字連続の出現を記録する
ための索引データの型式を指示し、サンプル文書データ
中における出現の度合が絞り込み率よりも高い高頻度２
文字連続については、高頻度２文字連続に属する２つの
２文字連続をそれぞれ初めの２文字、および最後の２文
字に持つ３文字連続のサンプル文書データ中における出
現の度合を３文字連続出現頻度算定手段が統計的に調
べ、索引型式出力手段が、検索対象文書データにおける
３文字連続の出現を記録するための索引データの型式を
指示することによって、文字および文字列の出現の度合
が異なっていても、検索条件によらずに絞り込み率以下
に検索対象文書データを絞り込むことを可能にする索引
データを作成することができる。According to the present invention, with the above configuration, the index data of the search target document data is statistically adapted to the search target document data, and the index data having a small capacity is not checked for the statistical property of the search target document data. To create fast,
Also, for low-frequency characters whose degree of appearance in the sample document is less than or equal to a predetermined value, that is, the narrowing-down rate or less, the index type output means records the occurrence of one character in the search target document data. For a high-frequency character whose degree of appearance in the sample document is higher than the narrowing-down rate, N characters belonging to the high-frequency character, that is, 2 characters consisting of two characters first
The two-character consecutive appearance frequency calculation means statistically examines the degree of appearance in the sample document data of consecutive letters, and the low-frequency two-letter consecutive in which the degree of appearance in the sample document data is equal to or less than the narrowing rate is output as an index type output. The means indicates the type of index data for recording the appearance of two consecutive characters in the search target document data, and the degree of appearance in the sample document data is higher than the narrowing rate.
Regarding the character continuity, the degree of appearance in the sample document data of two consecutive two characters belonging to the high-frequency two consecutive characters in the first two characters and the last two characters in the sample document data is calculated as the three consecutive character appearance frequency. The means statistically examines, and the index type output means indicates the type of index data for recording the appearance of three consecutive characters in the search target document data, so that the degree of appearance of characters and character strings is different. Also, it is possible to create index data that enables the search target document data to be narrowed down to the narrowing rate or less regardless of the search conditions.

【００１６】また、サンプル文書データ中の文字または
文字列の出現の度合を文字列出現頻度算定手段が統計的
に調べ、その後で、グループ化手段が、１つ以上の文字
または文字列からなるグループであって、当該グループ
に属する文字または文字列の少なくとも１種が出現する
度合が予め定めた絞り込み率以下であるグループに、サ
ンプル文書データ中の文字または文字列を振り分け、検
索対象文書データにおいて当該グループに所属するいず
れかの文字あるいは文字列が出現した場合には、「当該
グループに属する文字あるいは文字列のいずれかが出現
した」という情報を記録するための索引データの型式を
索引型式決定手段が決定して索引型式データを作成する
ことによって、多くの種類の低頻度文字がある場合で
も、容量の小さな索引を作成することができる。Further, the character string appearance frequency calculation means statistically checks the degree of appearance of the character or character string in the sample document data, and then the grouping means makes a group consisting of one or more characters or character strings. In addition, the characters or character strings in the sample document data are sorted into groups in which the degree of appearance of at least one of the characters or character strings belonging to the group is less than or equal to a predetermined narrowing rate, and If any character or character string belonging to the group appears, the index data type determining unit determines the index data type for recording the information that "any character or character string belonging to the group has appeared". Decides to create indexed type data, so even if you have many types of infrequent characters, It is possible to create.

【００１７】さらに、文書検索装置においては、利用者
が検索条件入力手段から入力した文字列パターンを含む
検索条件から、索引照合条件作成手段が、索引データを
照合するための文字または文字列のＡＮＤ／ＯＲ木を作
成し、索引照合手段が、索引データと、索引照合条件作
成手段の作成したＡＮＤ／ＯＲ木との照合を行い、照合
が成功したデータの場合には、全文走査文字列照合手段
が、検索対象文書データの対応する部分を検索条件入力
手段から入力された文字列パターンを含む検索条件と完
全に照合し、照合の成功した部分を最終的な検索結果と
することにより、従来はフルテキストスキャン方式でし
か扱えなかった複雑な検索条件の場合でも、索引による
検索速度の高速化を実現することができる。Further, in the document search device, the index matching condition creating means ANDs the characters or character strings for matching the index data from the search condition including the character string pattern inputted by the user from the search condition inputting means. / OR tree is created, and the index matching means matches the index data with the AND / OR tree created by the index matching condition creating means, and if the data is successfully matched, the full text scanning character string matching means. However, by completely matching the corresponding part of the search target document data with the search condition including the character string pattern input from the search condition input means, and making the part that has been successfully matched the final search result, Even in the case of complicated search conditions that can only be handled by the full-text scan method, the search speed can be increased by using the index.

【００１８】[0018]

【Example】

（実施例１）以下、本発明の索引作成方法を実施するた
めの装置について、図面を参照しながら説明する。図１
は本発明の第１の実施例における索引作成装置の構成を
示すブロック図である。図１において、１０１は文書デ
ータを構成する複数の文書レコードを格納したサンプル
文書データである。サンプル文書データは、検索対象文
書データの全部または一部でもよく、検索対象文書デー
タに対し、文字および文字列の出現に関する統計的性質
が類似している他の文書データであってもよい。１０２
はサンプル文書データ１０１中の各文書レコードの位置
を記録したサンプル文書句切りデータ、１０３はサンプ
ル文書句切りデータ１０２の位置情報に従ってサンプル
文書データ１０１から指定された文書レコードを切り出
して、レコード先頭を表す特別な文字＜ＳＴＡＲＴ＞を
文書レコード先頭に付与し、レコード終了を表す特別な
文字＜ＥＮＤ＞を文書レコード末尾に付与した文字列を
出力する文書区切り手段、１０４は文書区切り手段１０
３の出力である文書レコード文字列を受け取ってサンプ
ル文書データ１０１中に出現する各文字の出現の度合
を、「当該文字の出現する文書レコードの文字数の総和
を全文書レコードの文字数の総和で除した値」として算
定する文字出現頻度算定手段、１０５は文書区切り手段
１０３の出力である文書レコード文字列と、文字出現頻
度算定手段１０４の算定結果とを受け取って、サンプル
文書データ１０１中に高頻度で出現する２文字連続の出
現の度合を、「当該２文字連続の出現する文書レコード
の文字数の総和を全文書レコードの文字数の総和で除し
た値」として算定する２文字連続出現頻度算定手段、１
０６は文書区切り手段１０３の出力である文書レコード
文字列と、２文字連続出現頻度算定手段１０５の算定結
果とを受け取って、サンプル文書データ１０１中に高頻
度で出現する３文字連続の出現の度合を、「当該３文字
連続の出現する文書レコードの文字数の総和を全文書レ
コードの文字数の総和で除した値」として算定する３文
字連続出現頻度算定手段、１０７は文字出現頻度算定手
段１０４の算定結果を受け取って、出現の度合が予め定
められた「絞り込み率」以下である複数の文字をグルー
プ化し、グループに属するいずれかの文字が出現する度
合が絞り込み率を越えない範囲で絞り込み率に最も近く
なるように調整する文字グループ化手段、１０８は２文
字連続出現頻度算定手段１０５の算定結果を受け取っ
て、出現の度合が絞り込み率以下である複数の２文字連
続をグループ化し、グループに属するいずれかの２文字
連続が出現する度合が絞り込み率を越えない範囲で絞り
込み率に最も近くなるように調整する２文字連続グルー
プ化手段、１０９は３文字連続出現頻度算定手段１０６
の算定結果を受け取って、出現の度合が絞り込み率以下
である複数の３文字連続がある場合には、これをグルー
プ化し、グループに属するいずれかの３文字連続が出現
する度合が絞り込み率を越えない範囲で絞り込み率に最
も近くなるように調整し、出現の度合が絞り込み率より
も高い３文字連続はそれ１つだけで１グループにする３
文字連続グループ化手段、１１０は文字グループ化手段
１０７と２文字連続グループ化手段１０８と３文字連続
グループ化手段１０９の出力であるグループ化情報を受
け取って、各グループに通し番号を付与し、各グループ
の通し番号と、所属文字あるいは２文字連続あるいは３
文字連続との対応表を出力する索引型式出力手段、１１
１は索引型式出力手段１１０の出力する索引型式データ
である。(Embodiment 1) An apparatus for carrying out the index creating method of the present invention will be described below with reference to the drawings. Figure 1
FIG. 1 is a block diagram showing a configuration of an index creating device according to a first embodiment of the present invention. In FIG. 1, reference numeral 101 is sample document data in which a plurality of document records forming the document data are stored. The sample document data may be all or a part of the search target document data, or may be other document data having similar statistical properties regarding the appearance of characters and character strings to the search target document data. 102
Is the sample document phrase segmentation data recording the position of each document record in the sample document data 101, and 103 is the segment of the designated document record from the sample document data 101 according to the position information of the sample document segmentation data 102. A document delimiter means 104 is provided with a special character <START> indicating the end of the document record and a special character <END> indicating the end of the record is added at the end of the document record.
The degree of appearance of each character that appears in the sample document data 101 after receiving the document record character string that is the output of No. 3 is "the total number of characters of the document records in which the character appears is divided by the total number of characters of all document records". The character appearance frequency calculation means 105 calculates the document appearance frequency of the document record character string output from the document delimiter means 103 and the calculation result of the character appearance frequency calculation means 104. A two-character continuous appearance frequency calculating means for calculating the degree of appearance of two consecutive characters appearing as "a value obtained by dividing the sum of the number of characters of the document record in which the two consecutive characters appear by the sum of the number of characters of all document records", 1
06 receives the document record character string output from the document delimiter 103 and the calculation result of the two-character consecutive appearance frequency calculator 105, and the degree of appearance of three consecutive characters that frequently appear in the sample document data 101. Is calculated as "a value obtained by dividing the sum of the numbers of characters of the document records in which three consecutive characters appear by the sum of the numbers of characters of all document records", 107 is a calculation of the character appearance frequency calculating unit 104. When the result is received, a plurality of characters whose degree of appearance is less than or equal to a predetermined “narrowing rate” are grouped, and the degree of occurrence of any character belonging to the group does not exceed the narrowing rate. The character grouping means 108 for adjusting so as to be closer to each other receives the calculation result of the two-character consecutive appearance frequency calculating means 105, and narrows down the degree of appearance. A group of two consecutive characters that is less than or equal to the refinement rate, and a two-character consecutive grouping that adjusts so that the degree of appearance of any two consecutive letters belonging to the group is closest to the refinement rate within the range that does not exceed the refinement rate. Means, 109 is a three-character consecutive appearance frequency calculation means 106
When there is a plurality of consecutive 3 characters whose degree of appearance is less than or equal to the narrowing rate, the degree of appearance is grouped, and the degree of appearance of any 3 consecutive characters belonging to the group exceeds the narrowing rate. Adjust it so that it is closest to the narrowing rate within the range that does not exist, and if there are three consecutive characters with a higher degree of appearance than the narrowing rate, group one by one 3
Character consecutive grouping means 110 receives grouping information output from the character grouping means 107, two character consecutive grouping means 108 and three character consecutive grouping means 109, assigns serial numbers to each group, and Serial number and belonging character or 2 consecutive characters or 3
Index type output means for outputting a correspondence table with character continuation, 11
Reference numeral 1 is index type data output by the index type output means 110.

【００１９】サンプル文書データ１０１には、図４に示
すような、書籍のＩＳＢＮ番号が１番号１文書レコード
として、５５３９６６レコード分記録されており、サン
プル文書区切りデータ１０２には、図４に示した文書デ
ータの各文書レコードの先頭の文字の、サンプル文書デ
ータ１０１先頭からの文字単位での隔たりが記録されて
いるものとし、絞り込み率として、０．０５を指定する
ものとする。In the sample document data 101, as shown in FIG. 4, the ISBN number of the book is recorded as 1 number 1 document record for 553966 records, and in the sample document delimiter data 102, it is shown in FIG. It is assumed that the character at the beginning of each document record of the document data is recorded in character units from the beginning of the sample document data 101, and 0.05 is specified as the narrowing rate.

【００２０】以上のように構成された索引作成装置につ
いて、その動作を説明する。まず、サンプル文書データ
１０１中の各文書レコードが、文書区切り手段１０３で
切り出されて、文字出現頻度算定手段１０４に送られ、
各文字の出現の度合が、当該文字の出現する文書レコー
ドの文字数の総和／全文書レコードの文字数の総和によ
って算定される。図５〜図２３は、この索引作成装置の
レポート出力であり、図５の「文字頻度表（Ｃｈａｒａ
ｃｔｅｒｈｉｓｔｇｒａｍ＆Ｍｏｎｏｇｒａｍ−
ｉｎｄｅｘ）」、図６〜図７の「高頻度２文字連連続
表」（ＦｒｅｑｕｅｎｔＤｉｇｒａｍ）、図８の「２
文字連続のグループ化結果」（Ｄｉｇｒａｍｐａｒｔ
ｉｔｉｏｎｓ）、図９〜図１８の「高頻度３文字連続
表」（ＴｒｉｇｒａｍＴａｂｌｅ）、図１９〜図２２
の「３文字連続のグループ化結果」（Ｔｒｉｇｒａｍ
ＰａｒｔｉｔｉｏｎｉｎｇＴａｂｌｅ）、図２３の
「索引の大きさ」（Ｉｎｄｅｘｓｉｚｅ）の各情報
が、順に記録されている。The operation of the index creating device configured as described above will be described. First, each document record in the sample document data 101 is cut out by the document delimiter 103 and sent to the character appearance frequency calculator 104,
The degree of appearance of each character is calculated by the sum of the number of characters in the document record in which the character appears / the sum of the number of characters in all the document records. FIG. 5 to FIG. 23 are report outputs of the index creating device, which are shown in FIG.
cter histgram & Monogram-
index) "," high-frequency two-character continuous table "of FIGS. 6 to 7 (Frequent Digram), and" 2 "of FIG.
Grouping result of consecutive characters "(Digram part
19), "High-frequency three-character continuous table" (Trigram Table) in FIGS.
"Results of grouping three consecutive characters" (Trigram
The Partitioning Table) and each information of the “index size” (Index size) of FIG. 23 are recorded in order.

【００２１】文字出現頻度算定手段１０４の算定結果で
ある図５において、「Ｏｃｃｕｒｅｎｃｅ」項目は、着
目文字の出現する文書レコードの文字数の総和を表し、
「（Ｐｅｒｃｅｎｔ）」項目は、着目文字の出現する文
書レコードの文字数の総和を全文書レコードの文字数の
総和で除した値に１００を乗じた値を表し、「Ｒａｎ
ｋ」項目は、着目文字の出現度合による順序を表し、
「Ｃｈａｒ」項目は、着目文字を表し、「Ｍｏｎｏｇｒ
ａｍ−ｉｎｄｅｘ」項目は、出現度合によってグループ
化された文字グループ番号を表す。例えば、「Ｃｈａ
ｒ」項目の文字「−」は、第４番目に高頻度の文字であ
り、出現文書レコードの文字数の総和は、３６９９２４
１文字であり、８７．２９％の出現頻度を持つ。なお、
このサンプル文書データの場合には、文字種が１６種と
少ないため、２文字以上からなる文字グループは存在し
ない。In FIG. 5, which is the calculation result of the character appearance frequency calculating means 104, the item "Occurence" represents the total number of characters in the document record in which the target character appears,
The “(Percent)” item represents a value obtained by multiplying 100 by a value obtained by dividing the total number of characters of the document records in which the target character appears by the total number of characters of all document records, and
The item “k” represents the order of appearance of the target character,
The "Char" item represents the character of interest, and "Monogr"
The “am-index” item represents a character group number grouped by the degree of appearance. For example, "Cha
The character “−” in the “r” item is the fourth most frequent character, and the total number of characters in the appearing document record is 369924.
It is one character and has an appearance frequency of 87.29%. In addition,
In the case of this sample document data, since there are as few as 16 character types, there is no character group consisting of two or more characters.

【００２２】こうして、サンプル文書データ１０１の１
回目の走査が終了したら、文書区切り手段１０３は、サ
ンプル文書データ１０１の２回目の走査を開始し、切り
出した文書レコードを２文字連続出現頻度算定手段１０
５に送る。２文字連続出現頻度算定手段１０５は、文書
レコード中の２文字連続のうちで、高頻度文字（この例
の場合には、全ての文字種が高頻度文字に当たる）同士
の連続のみを抽出し、各２文字連続の出現の度合が、
「当該２文字連続の出現する文書レコードの文字数の総
和／全文書レコードの文字数の総和」によって算定され
る。そのうちで、出現の度合が絞り込み率よりも高い、
高頻度の２文字連続を表示したのが、図６〜図７の「高
頻度２文字連続表」である。Thus, 1 of the sample document data 101
After the scanning of the second time is completed, the document dividing means 103 starts the second scanning of the sample document data 101, and the cut-out document record is calculated as the two-character consecutive appearance frequency calculating means 10.
Send to 5. The two-character continuous appearance frequency calculation means 105 extracts only the series of high-frequency characters (in this example, all character types correspond to high-frequency characters) from the two-character series in the document record, and The degree of appearance of two consecutive characters is
It is calculated by "the sum of the number of characters of the document records in which the two consecutive characters appear / the sum of the number of characters of all the document records". Among them, the degree of appearance is higher than the narrowing rate,
The high-frequency two-character continuous table shown in FIGS. 6 to 7 displays the high-frequency two-character continuous.

【００２３】図６において、「Ｎｏ．」項目は、高頻度
２文字連続の通し番号を表し、「Ｄｉｇｒａｍ−Ｃｏｄ
ｅ」項目は、２文字連続を構成する第１文字、第２文字
を文字グループ番号で表現した組を表し、「Ｄｉｇｒａ
ｍ−Ｃｈａｒａｃｔｅｒ」項目は、２文字連続を構成す
る文字列を表し、「Ｏｃｃ」項目は着目２文字連続の出
現する文書レコードの文字数の総和を表し、「（Ｐｅｒ
ｃｅｎｔ）」項目は、着目２文字連続の出現する文書レ
コードの文字数の総和を全文書レコードの文字数の総和
で除した値に１００を乗じた値を表す。例えば、通し番
号が６である「−４」という２文字連続は、出現の度合
が１７．５７％であることがわかる。高頻度文字同士か
らなる２文字連続のうち、高頻度２文字連続以外のすべ
てを、文字グループ番号で第１文字、第２文字を表現し
た際の文字列の昇順に並べ、その後で、並び順が近接す
る２文字連続を、２文字連続グループ化手段１０８がグ
ループにまとめる。グループ化の際の、グループのいず
れかの２文字連続が表れる度合の算定法は、グループ内
の各２文字連続の出現が統計的に独立であると仮定し、
以下の式から求める。In FIG. 6, the "No." item represents a serial number of two consecutive high-frequency characters, and is "Digram-Cod".
The “e” item represents a set in which the first character and the second character that form two consecutive characters are represented by a character group number, and “Digra
The “m-Character” item represents a character string that constitutes two consecutive characters, the “Occ” item represents the total number of characters of the document record in which two consecutive consecutive characters of interest appear, and “(Per
The “cent)” item represents a value obtained by multiplying 100 by a value obtained by dividing the total number of characters of the document records in which two consecutive focused characters appear by the total number of characters of all the document records. For example, it can be seen that the degree of appearance of two consecutive characters “-4” having a serial number of 6 is 17.57%. Of the two consecutive characters consisting of high-frequency characters, all but the two consecutive high-frequency characters are arranged in ascending order of the character string when the first character and the second character are represented by the character group number, and then the arrangement order. The two-character continuous grouping means 108 groups together two-character continuous characters that are adjacent to each other. The method of calculating the degree of occurrence of any two consecutive characters in a group when grouping assumes that the occurrence of each two consecutive characters in the group is statistically independent,
Calculate from the following formula.

【００２４】[0024]

【数１】ただし、Ｐはグループ内のｎ個の２文字連続のいずれか
が現れる度合であり、Ｐ_j（ｊ＝１，２，・・・ｎ）は
グループ内のｊ番目の２文字連続が現れる度合である。[Equation 1] However, P is the degree to which any one of the n two consecutive letters in the group appears, and P _j (j = 1, 2, ... N) is the degree to which the j-th consecutive two letters in the group appears. is there.

【００２５】その結果が、図８の「２文字連続のグルー
プ化結果」である。図８において、「Ｎｏ．」項目は、
２文字連続のグループ番号を表し、「Ｄｉｇｒａｍ−Ｃ
ｏｄｅ」項目はグループとグループの境界に位置する当
該グループ中で最も文字列順の大きい２文字連続の第１
文字、第２文字の文字グループ番号を表し、「Ｄｉｇｒ
ａｍ−Ｃｈａｒａｃｔｅｒ」項目は当該２文字連続を構
成する文字列を表す。例えば、２文字連続のグループ番
号７には、２文字連続「２２」および「２３」が含ま
れ、２文字連続のグループ番号８には、２文字連続「２
５」、「２６」、「２７」、「２９」が含まれる。The result is the "grouping result of two consecutive characters" in FIG. In FIG. 8, the “No.” item is
Represents a group number consisting of two consecutive characters, "Digram-C
The "ode" item is the first of two consecutive characters in the largest character string order in the group located at the boundary between the groups.
Represents the character group number of the character, the second character, "Digr
The “am-Character” item represents a character string that constitutes the consecutive two characters. For example, the group number 7 with two consecutive characters includes two consecutive characters “22” and “23”, and the group number 8 with two consecutive characters has two consecutive characters “2”.
5 ”,“ 26 ”,“ 27 ”, and“ 29 ”are included.

【００２６】こうして、サンプル文書データ１０１の２
回目の走査が終了したら、文書区切り手段１０３は、サ
ンプル文書データ１０１の３回目の走査を開始し、切り
出した文書レコードを３文字連続出現頻度算定手段１０
６に送る。３文字連続出現頻度算定手段１０６は、文書
レコード中の３文字連続のうちで、（第１文字、第２文
字）および（第２文字、第３文字）がいずれも高頻度２
文字連続である３文字連続のみを抽出し、各３文字連続
の出現の度合が、「当該３文字連続の出現する文書レコ
ードの文字数の総和／全文書レコードの文字数の総和」
によって算定され、その結果が３文字連続グループ化手
段１０９に送られ、式（１）と同様の基準によって、絞
り込み率をもとにグループ化される。その結果を表示し
たのが図９〜図１８の「高頻度３文字連続表」および図
１９〜図２２の「３文字連続のグループ化結果」であ
る。Thus, 2 of the sample document data 101
When the scanning of the third time is completed, the document dividing means 103 starts the third scanning of the sample document data 101, and the cut-out document record is calculated by the three-character consecutive appearance frequency calculating means 10.
Send to 6. The three-character continuous appearance frequency calculation means 106 has a high frequency of 2 for the (first character, second character) and (the second character, third character) among the three characters in the document record.
Only three consecutive characters, which are consecutive characters, are extracted, and the degree of appearance of each consecutive three characters is "total sum of character numbers of document records in which the consecutive three characters appear / total sum of character numbers of all document records".
And the result is sent to the three-character continuous grouping means 109, and the results are grouped based on the narrowing-down rate according to the same criteria as in formula (1). The results are displayed in the "high-frequency three-character continuous table" of FIGS. 9 to 18 and the "three-character continuous grouping result" of FIGS. 19 to 22.

【００２７】図９において、「Ｎｏ．」項目は、３文字
連続の通し番号を表し、「Ｇｒｏｕｐ」項目は、３文字
連続のグループ番号を表し、「Ｔｒｉｇｒａｍ−Ｃｏｄ
ｅ」項目は、３文字連続の第１文字、第２文字、第３文
字の文字グループ番号を表し、「Ｔｒｉｇｒａｍ−Ｃｈ
ａｒａｃｔｅｒ」項目は、当該３文字連続を構成する文
字列を表し、「Ｏｃｃ」項目は、着目３文字連続の出現
する文書レコードの文字数の総和を表し、「（Ｐｅｒｃ
ｅｎｔ）」項目は、着目３文字連続の出現する文書レコ
ードの文字数の総和を全文書レコードの文字数の総和で
除した値に１００を乗じた値を表す。In FIG. 9, the "No." item represents a serial number of three consecutive characters, the "Group" item represents a group number of three consecutive characters, and "Trigram-Cod".
The “e” item represents the character group number of the first character, the second character, and the third character of three consecutive characters, and “Trigram-Ch
The “actor” item represents a character string that constitutes the consecutive three characters, the “Occ” item represents the total number of characters in the document record in which the consecutive three characters of interest appear, and “(Perc
The “ent)” item represents a value obtained by multiplying 100 by a value obtained by dividing the total number of characters of document records in which three consecutive focused characters appear by the total number of characters of all document records.

【００２８】また、図１９において、「Ｎｏ．」項目
は、３文字連続のグループ番号を表し、「Ｔｒｉｇｒａ
ｍ−Ｃｏｄｅ」項目は、グループとグループの境界に位
置する当該グループ中で文字グループ番号で計った文字
列順の大きい２文字連続の第１文字、第２文字の文字グ
ループ番号を表し、「Ｔｒｉｇｒａｍ−Ｃｈａｒａｃｔ
ｅｒ」項目は、当該３文字連続を構成する文字列を表
す。例えば、３文字連続のグループ番号１３８には、３
文字連続「２０２」、「２０３」、「２０５」、「２０
６」、「２０７」、「２０９」、「２０４」が所属す
る。Further, in FIG. 19, the "No." item represents a group number of three consecutive characters, which is "Trigra".
The “m-Code” item represents the character group number of the first character and the second character of two consecutive characters in a large character string order measured by the character group number in the group located at the boundary between the groups. -Charact
The “er” item represents a character string that constitutes the three consecutive characters. For example, a group number 138 of three consecutive characters has 3
Character sequence "202", "203", "205", "20"
6 ”,“ 207 ”,“ 209 ”, and“ 204 ”belong.

【００２９】こうして、得られたグループ化情報が、索
引型式出力手段１１０に送られ、低頻度文字グループ、
２文字連続グループ、３文字連続グループの１つ１つに
対して、１ｂｉｔの索引情報を割り当てるような索引型
式を、索引型式データ１１１に出力する。作成される索
引の文書レコード当りの大きさを表示したものが、図２
３の「索引の大きさ」である。The grouping information thus obtained is sent to the index type output means 110, where the infrequent character group,
An index type that assigns 1-bit index information to each of the two-character continuous group and the three-character continuous group is output to the index type data 111. Figure 2 shows the size of each index record created.
3 is the "index size".

【００３０】図２３において、「１）Ｍｏｎｏｇｒａｍ
−ｉｎｄｅｘ：」項目は、低頻度文字索引の大きさを表
し、「２）Ｄｉｇｒａｍ−ｉｎｄｅｘ：」項目は、２文
字連続の索引の大きさを表し、「３）Ｔｒｉｇｒａｍ−
ｉｎｄｅｘ：」項目は、３文字連続の索引の大きさを表
し、「ＴｏｔａｌＩｎｄｅｘｓｉｚｅ：」項目は、
１）２）３）を合計した１文書レコード当りの索引のサ
イズを表す。この例では、１文書レコードあたり、合計
３２バイトの索引が作成される。In FIG. 23, "1) Monogram
The "-index:" item represents the size of the infrequent character index, the "2) Digim-index:" item represents the size of the index of two consecutive characters, and the "3) Trigram-" item.
The "index:" item represents the size of an index of three consecutive characters, and the "Total Index size:" item is
It represents the size of the index per document record, which is the sum of 1), 2) and 3). In this example, an index of 32 bytes is created for each document record.

【００３１】以上のように、本実施例によれば、サンプ
ル文書データの文字および文字列の出現の度合から、多
く出現する文字については、２文字連続情報を用いて、
より詳細な索引情報をつくり、その中でより多く出現す
る２文字連続には、３文字連続情報を用いてさらに詳細
な索引情報をつくることで、高頻度で出現する文字およ
び文字列に対して高精度な索引型式データを作成でき、
逆に、あまり出現しない文字および文字列については、
グループ化によって、索引情報の容量を縮小した索引型
式データを作成することができる。As described above, according to the present embodiment, two-character consecutive information is used for a character that appears abundantly based on the degree of appearance of characters and character strings in the sample document data.
By creating more detailed index information, and by using the three-character continuous information to create more detailed index information for the two-character consecutive occurrences that occur more frequently, for characters and character strings that appear frequently You can create highly accurate index type data,
On the other hand, for characters and character strings that rarely appear,
By grouping, index type data with a reduced amount of index information can be created.

【００３２】（実施例２）次に、本発明の第２の実施例
について、図面を参照しながら説明する。図２は本発明
の第２の実施例における索引作成装置の構成を示すブロ
ック図である。図２において、２０１は複数の文書レコ
ードを格納した検索対象文書データ、２０２は検索対象
文書データ２０１中の各文書レコードの位置を記録した
検索対象文書句切りデータ、２０３は検索対象文書句切
りデータ２０２の位置情報に従って検索対象文書データ
２０１から指定された文書レコードを切り出して、レコ
ード先頭を表す特別な文字＜ＳＴＡＲＴ＞を文書レコー
ド先頭に付与し、レコード終了を表す特別な文字＜ＥＮ
Ｄ＞を文書レコード末尾に付与した文字列を出力する文
書区切り手段、２０４は図１に示した索引作成装置によ
り作成された索引型式データ、２０５は文書区切り手段
２０３の出力である文書レコード文字列を受け取って、
検索対象文書データ２０１中に出現する各文字から始ま
る文字列の、索引作成時に用いる検索文字数が１である
か２であるか３であるかを、索引型式データ２０４に従
って決定する文字連続数算定手段、２０６は文字連続数
算定手段２０５の算定結果である文字数と、文字列およ
び索引型式データのグループの定義とを受け取って、対
応するグループ番号を算定するグループ番号算定手段、
２０７はグループ番号算定手段２０６の出力であるグル
ープ番号を受け取って、１文書レコードの索引情報を作
成して出力する索引情報蓄積出力手段、２０８は索引情
報蓄積出力手段２０７が出力する検索対象データ２０１
に関する索引データである。(Second Embodiment) Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a block diagram showing the configuration of the index creating device according to the second embodiment of the present invention. In FIG. 2, 201 is search target document data that stores a plurality of document records, 202 is search target document phrase cut data that records the position of each document record in the search target document data 201, and 203 is search target document phrase cut data. A specified document record is cut out from the search target document data 201 according to the position information of 202, a special character <START> indicating the beginning of the record is added to the beginning of the document record, and a special character <EN indicating the end of the record is added.
A document delimiter that outputs a character string with D> added to the end of the document record, 204 is index type data created by the index creating device shown in FIG. 1, and 205 is a document record character string that is an output of the document delimiter 203. Received
A character continuous number calculating means for determining, according to the index type data 204, whether the number of search characters used for index creation of the character string starting from each character appearing in the search target document data 201 is 1, 2, or 3 , 206 is a group number calculating means for receiving the number of characters which is the calculation result of the continuous character number calculating means 205 and the definition of the group of the character string and the index type data, and calculating the corresponding group number,
207 is an index information storage / output unit that receives the group number output from the group number calculation unit 206 and creates and outputs index information for one document record, and 208 is search target data 201 output from the index information storage / output unit 207.
It is index data regarding.

【００３３】以上のように構成された索引作成装置につ
いて、その動作を、図２４に示すこの索引作成装置が動
作する際に出力したレポート出力を例にして説明する。
図２４において、ＲｅｃｏｒｄＮｏ，２の「＜ＳＴＡ
ＲＴ＞４−５８７−５１１５１−Ｘ＜ＥＮＤ＞」は、検
索対象データの第２レコードの切り出し結果である。こ
の文字列が文字連続数算定手段２０５に送られると、ま
ず、各文字を文字グループ番号に直し、次に索引型式デ
ータ２０４にしたがって、文字連続数を算定する。この
例の文字列は、文字グループ番号で表現すると、「０、
２、３、９、６、１１、３、７、７、１０、５、４、
３、１０、１」となる。そして、先頭の「＜ＳＴＡＲＴ
＞，４」なる２文字が２文字連続として、文字グループ
番号の組［０−２］で切り出され、グループ番号算定手
段２０６が、これからグループ番号０を算定し、索引情
報蓄積出力手段２０７が、これを受け取って、内部のビ
ット列の０番目のものを、１６進「００００」から１６
進「０００１」に変える。先頭から２文字目の「４，
−，５」なる３文字の場合は、３文字連続として、文字
グループ番号の組［２−３−９］で切り出され、グルー
プ番号算定手段２０６が、これからグループ番号７２を
算定し、索引情報蓄積出力手段２０７が、これを受け取
って、内部のビット列の４番目のものを、１６進「００
００」から１６進「０１００」に変える。このようにし
て、着目文字を次々と移動させながら、索引情報蓄積出
力手段２０７の内部のビット列に、第２レコードの索引
情報をビット列の形で蓄積する。最後の文字の処理が終
了した場合には、蓄積したビット列を、索引データ２０
８に出力する。以上の処理を各文書レコードに対して次
々に行うことにより、最終的に、検索対象データ１０１
内の全文書レコードに関する索引情報を、索引データ２
０８に格納し、索引作成処理を終了する。The operation of the index creating device configured as described above will be described with reference to the report output output when the index creating device shown in FIG. 24 operates.
In FIG. 24, “<STA of Record No. 2”
“RT> 4-587-51115-X <END>” is the cutout result of the second record of the search target data. When this character string is sent to the character continuous number calculating means 205, first, each character is converted into a character group number, and then the number of continuous characters is calculated according to the index type data 204. The character string in this example is expressed as "0,
2, 3, 9, 6, 11, 3, 7, 7, 10, 5, 4,
It becomes 3, 10, 1 ". And at the beginning, "<START
>, 4 ”as two consecutive characters are cut out by a set of character group numbers [0-2], the group number calculation means 206 calculates the group number 0 from this, and the index information storage output means 207 calculates When this is received, the 0th bit of the internal bit string is changed from hexadecimal "0000" to 16
Change to "0001". The second character from the beginning, "4
In the case of 3 characters "-, 5", it is cut out as a set of 3 groups of character group numbers [2-3-9], and the group number calculation means 206 calculates the group number 72 from this and stores the index information. The output means 207 receives this and outputs the fourth bit of the internal bit string to the hexadecimal "00".
Change from "00" to hexadecimal "0100". In this way, the index information of the second record is accumulated in the bit string inside the index information accumulating / outputting means 207 in the form of the bit string while moving the focused character one after another. When the processing of the last character is completed, the accumulated bit string is used as the index data 20.
Output to 8. By performing the above-described processing on each document record one after another, finally the search target data 101
The index information about all the document records in the index data 2
08, and the index creation process ends.

【００３４】このように、本実施例の索引作成装置によ
れば、索引型式データ２０４が、検索対象文書データと
文字および文字列の出現の度合が類似している場合に
は、索引型式データ２０４内の統計情報を用いて、検索
対象文書データを調べることなしに、多く出現する文字
については、２文字連続情報を用いてより詳細な索引情
報を作り、その中でより多く出現する２文字連続には、
３文字連続情報を用いてさらに詳細な索引情報をつくる
ことで、高頻度で出現する文字および文字列に対して、
高精度な索引データを作成でき、逆に、あまり出現しな
い文字および文字列については、グループ化によって、
索引情報の容量を縮小した索引データを作成することが
できる。As described above, according to the index creating apparatus of this embodiment, when the index type data 204 is similar to the document data to be searched in the degree of appearance of characters and character strings, the index type data 204 is used. For the characters that appear frequently without looking up the document data to be searched using the statistical information in, the more detailed index information is created using the 2-character continuation information, and the 2 characters that appear more often in the index information are created. Has
By creating more detailed index information using three-character continuous information,
High-precision index data can be created, and conversely, for characters and character strings that rarely appear, grouping
It is possible to create index data with a reduced amount of index information.

【００３５】（実施例３）次に、本発明の第３の実施例
について、図面を参照しながら説明する。図３は本発明
の文書検索方法を用いた文書検索装置の一実施例を示す
ブロック図である。図３において、３０１は複数の文書
レコードを格納した検索対象文書データ、３０２は利用
者が検索条件を入力する検索条件入力手段、３０３は検
索対象文書データ３０１に関する索引情報を記録した図
２の索引作成装置を用いて作成した索引データ、３０４
は検索対象文書データ３０１を走査して、検索条件入力
手段３０２から入力された検索条件と照合する文書レコ
ードを出力する全文走査文字列照合手段、３０５は検索
条件入力手段３０２から入力された検索条件を、索引デ
ータ３０３が取り扱える検索条件に変形する索引照合条
件作成手段、３０６は索引データ３０３と、索引照合条
件作成手段３０５の作成した索引照合条件との照合を行
って、照合した文書レコードの情報を、全文走査文字列
照合手段３０４に通知する索引照合手段、３０７は全文
走査文字列照合手段３０４が出力する検索結果である。(Embodiment 3) Next, a third embodiment of the present invention will be described with reference to the drawings. FIG. 3 is a block diagram showing an embodiment of a document search device using the document search method of the present invention. In FIG. 3, reference numeral 301 is search target document data storing a plurality of document records, 302 is search condition input means for the user to input search conditions, and 303 is index of FIG. 2 in which index information regarding the search target document data 301 is recorded. Index data created using the creating device, 304
Is a full-text scanning character string collating unit that scans the search target document data 301 and outputs a document record that matches the search condition input from the search condition input unit 302, and 305 is a search condition input from the search condition input unit 302. Is converted into a search condition that can be handled by the index data 303. Reference numeral 306 is a collation between the index data 303 and the index collating condition created by the index collating condition creating unit 305, and information of the collated document record. To the full-text scanning character string collating means 304, and 307 is a search result output by the full-text scanning character string collating means 304.

【００３６】以上のように構成された文書検索装置につ
いて、その動作を、図２５の検索例により説明する。図
２５において、「キーワード？」の次の文字列が、利用
者が検索条件入力手段３０２を用いて入力した検索条件
で、この例では、正規表現「１１５［１−３］−Ｘ」が
入力されている。この検索条件の解釈は、「１１５１−
Ｘ」か、「１１５２−Ｘ」か、あるいは「１１５３−
Ｘ」のいずれかがレコード中に含まれる文書データを全
て求めよ、ということである。この検索条件が索引照合
条件作成手段３０５に送られると、図２５の「Ｍａｔｃ
ｈｉｎｇＶｅｃｔｏｒ」以下で示されているように、
検索照合条件がベクトルに埋め込まれたＡＮＤ／ＯＲ木
の型式で求まる。このベクトルの各要素は（位置−オフ
セット−ビット列）の情報を持つ。その解釈は、図２６
および図２７のようになる。この例では、例えば「１１
５」の３文字連続に対応する要素が（５−１０−１００
０）で、このうち、「１０」と１６進「１０００」で、
文書レコードに対応する索引情報のビット列中のビット
を特定する。このベクトル型式の検索照合条件が、索引
照合手段３０６に送られ、索引データ３０３と照合さ
れ、図２５の「Ｉｎｄｅｘｍａｔｃｈ」以下のよう
に、「ＲｅｃｏｒｄＮｏ．３（４−５８７−５１１５
１−Ｘ）」や「ＲｅｃｏｒｄＮｏ．１０３４７（４−
０９−１５１８０１−Ｘ）」などの文書レコードが照合
し、このレコードの位置情報が全文走査文字列照合手段
３０４に送られる。全文走査文字列照合手段３０４は，
この索引照合手段３０６が照合に成功した文書レコード
の位置情報と、検索対象文書データ３０１の文書情報を
もとに、必要な文書レコードを読み込み、検索条件入力
手段３０２から入力された検索条件、この例では正規表
現「１１５［１−３］−Ｘ」との文字列照合を行い、図
２５の「Ｒｅｓｕｌｔ」のような、最終的な結果を得
て、検索結果３０７に格納し、文書検索処理を終了す
る。The operation of the document retrieval apparatus configured as described above will be described with reference to the retrieval example of FIG. In FIG. 25, the character string next to “keyword?” Is the search condition input by the user using the search condition input means 302, and in this example, the regular expression “115 [1-3] -X” is input. Has been done. The interpretation of this search condition is "1151-
X "," 1152-X ", or" 1153-
It means that any of "X" should be obtained for all the document data included in the record. When this search condition is sent to the index matching condition creating means 305, "Matc" in FIG.
"hing Vector", as shown below,
The search matching condition is found by the type of AND / OR tree embedded in the vector. Each element of this vector has information of (position-offset-bit string). The interpretation is shown in FIG.
And it becomes like FIG. In this example, for example, "11
The element corresponding to three consecutive characters of "5" is (5-10-100
0), of these, "10" and hexadecimal "1000",
The bit in the bit string of the index information corresponding to the document record is specified. The search collation condition of this vector type is sent to the index collating means 306 and collated with the index data 303, and as shown in “Index match” in FIG. 25 and thereafter, “Record No. 3 (4-587-5115).
1-X) "or" Record No. 10347 (4-
09-151801-X) ”and the like, and the position information of this record is sent to the full-text scanning character string collating means 304. The full-text scanning character string collating means 304,
Based on the position information of the document record that the index matching unit 306 has successfully matched, and the document information of the search target document data 301, the necessary document record is read, and the search condition input from the search condition input unit 302, In the example, the character string matching with the regular expression “115 [1-3] -X” is performed, a final result such as “Result” in FIG. 25 is obtained, and the final result is stored in the search result 307, and the document search processing is performed. To finish.

【００３７】このように、本実施例の文書検索装置によ
れば、索引容量が小さく、正規表現などの複雑な検索条
件にも対応可能な本発明の索引作成装置を援用して作成
した索引データを用いて、従来はフルテキストスキャン
方式でしか扱えなかった複雑な検索条件の場合でも、高
速に文書検索を実行することができる。As described above, according to the document search apparatus of the present embodiment, index data created with the help of the index creation apparatus of the present invention, which has a small index capacity and can cope with complicated search conditions such as regular expressions. By using, it is possible to execute a document search at high speed even in the case of a complicated search condition that can be conventionally handled only by the full-text scan method.

【００３８】[0038]

【発明の効果】本発明は、上記各実施例から明らかなよ
うに、検索対象文書データに関する索引データを作成す
る際に、まずサンプル文書データの文字および文字列の
出現を統計的に調べて索引データを作成する際の共通情
報となる索引型式データを作成し、この索引型式データ
の型式に従って索引データを作成するようにしたので、
作成・更新時間が短く、容量が小さく、正規表現などの
複雑な文字列パターンでの近似検索も高速で行なうこと
のできる索引作成方法とその装置および作成された索引
データとフルテキストスキャンとを組み合わせた検索速
度の速い文書検索装置を実現することができる。As is apparent from the above-described embodiments, the present invention statistically examines the appearance of characters and character strings in the sample document data when creating index data for the document data to be searched, and the index is created. Since index type data that is common information when creating data is created and index data is created according to the type of this index type data,
Combining an index creation method and its device, and the created index data and full-text scan, which can create and update in a short time, has a small capacity, and can perform approximate search with complex character string patterns such as regular expressions at high speed. It is possible to realize a document search device having a high search speed.

【００３９】特に、文字および文字列の出現の度合があ
まり変わらない多数の検索対象文書がある場合や、検索
対象文書の更新がひんぱんに行われる場合などは、一旦
索引型式データを作成しておけば、きわめて高速に、小
容量の索引データが作成でき、検索条件の制約なしに、
フルテキストスキャンの高速化を図ることができ、その
効果は大きい。ちなみに本発明による索引データを用い
れば、全国紙の新聞１年分をキーワード１個で検索した
場合、検索速度を従来の２０倍程度も向上させることが
できる。In particular, when there are a large number of search target documents in which the degree of appearance of characters and character strings does not change so much, or when the search target documents are frequently updated, index type data should be created once. In this way, a very small amount of index data can be created at extremely high speed, and there are no restrictions on search conditions.
Full-text scanning can be speeded up, and its effect is great. By the way, by using the index data according to the present invention, when one year worth of newspapers of national newspapers are searched with one keyword, the search speed can be improved by about 20 times compared with the conventional case.

[Brief description of drawings]

【図１】本発明の第１の実施例における索引型式作成装
置の構成を示すブロック図FIG. 1 is a block diagram showing the configuration of an index type creating device according to a first embodiment of the present invention.

【図２】本発明の第２の実施例における索引作成装置の
構成を示すブロック図FIG. 2 is a block diagram showing a configuration of an index creating device according to a second embodiment of the present invention.

【図３】本発明の第３の実施例における文書検索装置の
構成を示すブロック図FIG. 3 is a block diagram showing a configuration of a document search device according to a third embodiment of the present invention.

【図４】第１の実施例におけるサンプル文書データの一
部を示す一覧図FIG. 4 is a list diagram showing a part of sample document data in the first embodiment.

【図５】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 5 is a list diagram showing report output related to index type creation processing according to the first embodiment.

【図６】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 6 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図７】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 7 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図８】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 8 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図９】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 9 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１０】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 10 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１１】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 11 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１２】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 12 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１３】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 13 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１４】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 14 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１５】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 15 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１６】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 16 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１７】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 17 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１８】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 18 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図１９】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 19 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図２０】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 20 is a list diagram showing report output related to index type creation processing in the first example.

【図２１】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 21 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図２２】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 22 is a list diagram showing report output related to index type creation processing in the first example.

【図２３】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 23 is a list diagram showing report output related to index type creation processing in the first embodiment.

【図２４】第２の実施例における索引作成処理に関する
レポート出力を示す一覧図FIG. 24 is a list diagram showing report output related to index creation processing in the second embodiment.

【図２５】第３の実施例における文書検索装置の検索例
を示す一覧図FIG. 25 is a list diagram showing a search example of the document search device according to the third embodiment.

【図２６】第３の実施例における索引照合条件の形式と
解釈を説明するための一覧図FIG. 26 is a list diagram for explaining the format and interpretation of index matching conditions in the third embodiment.

【図２７】第３の実施例における索引照合条件の形式と
解釈を説明するための一覧図FIG. 27 is a list diagram for explaining the format and interpretation of index matching conditions in the third embodiment.

[Explanation of symbols]

１０１サンプル文書データ１０２サンプル文書区切りデータ１０３文書区切り手段１０４文字出現頻度算定手段１０５２文字連続出現頻度算定手段１０６３文字連続出現頻度算定手段１０７文字グループ化手段１０８２文字連続グループ化手段１０９３文字連続グループ化手段１１０索引型式出力手段１１１索引型式データ２０１検索対象データ２０２検索対象文書区切りデータ２０３文書区切り手段２０４索引型式データ２０５文字連続数算定手段２０６グループ番号算定手段２０７索引情報蓄積出力手段２０８索引データ３０１検索対象データ３０２検索条件入力手段３０３索引データ３０４全文走査文字列照合手段３０５索引照合条件作成手段３０６索引照合手段３０７検索結果 101 sample document data 102 sample document delimiter data 103 document delimiter means 104 character appearance frequency calculation means 105 2 character consecutive appearance frequency calculation means 106 3 character consecutive appearance frequency calculation means 107 character grouping means 108 2 character consecutive grouping means 109 3 characters Continuous grouping means 110 Index type output means 111 Index type data 201 Search target data 202 Search target document delimiter data 203 Document delimiter means 204 Index type data 205 Character continuous number calculation means 206 Group number calculation means 207 Index information accumulation output means 208 Index Data 301 Search target data 302 Search condition input means 303 Index data 304 Full-text scanning character string matching means 305 Index matching condition creating means 306 Index matching means 307 Search results

Claims

[Claims]

1. When creating index data relating to search target document data, index type data which is common information when creating index data by statistically examining the appearance of characters and character strings in sample document data. And creating the index data according to the model of the index model data.

2. A character appearance frequency calculation means for statistically checking the degree of appearance of a certain character in the sample document data, and if the degree of appearance of the previously examined character is higher than a predetermined value, A plurality of N-character consecutive appearance frequency calculating means for statistically checking the degree of appearance of an N-character (N is a natural number of 2, 3, 4, ...) Character string including all the examined characters, and a sample document It becomes common information when the index data relating to the search target document data is created from the output of the character appearance frequency calculating means and the plurality of N character consecutive appearance frequency calculating means according to the degree of appearance of the character or the character string in the data. An index creating device having an index model output means for creating index model data.

3. A plurality of grouping means for grouping characters or character strings in the sample document data according to the degree of their appearance, and the index type output means outputs the grouping output from each of the grouping means. 3. The index creating device according to claim 2, wherein a correspondence table of the serial numbers of the respective groups and the characters or character strings to which they belong is output based on the information.

4. A character continuous number calculating means for determining the number of search characters used when creating index data related to document data to be searched according to the index type data output from the index creating device according to claim 3, and the continuous character number. From the group number calculation means for calculating the corresponding group number from the number of characters determined by the calculation means and the index type data output from the index creating device according to claim 3, and the group number output from the group number calculation means. An index creation device having index information storage output means for creating index data of each document record.

5. A search input means for inputting a search condition including a character string pattern and an AND / character combination of a character or a character string for matching the index data output from the index creating device according to claim 4 with the search condition. Index matching condition creating means for creating an OR tree, index matching means for matching the index data with the AND / OR tree created by the index matching condition creating means, and search target document data when the matching is successful A document provided with full text scanning character string collating means for collating the corresponding portion of the above with the retrieval condition including the character string pattern input from the retrieval condition input means and outputting the portion for which the collation succeeds as the final retrieval result. Search device.