JP3081093B2

JP3081093B2 - Index creation method and apparatus and document search apparatus

Info

Publication number: JP3081093B2
Application number: JP05253032A
Authority: JP
Inventors: 野祐司菅
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1993-10-08
Filing date: 1993-10-08
Publication date: 2000-08-28
Anticipated expiration: 2015-08-28
Also published as: JPH07105237A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、電子計算機を応用した
文書検索システムや文書編集システムにおける文書中か
ら文字列等を検索するための索引の作成方法およびその
装置と文書検索装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for creating an index for retrieving a character string or the like from a document in a document retrieval system or a document editing system using an electronic computer. .

【０００２】[0002]

【従来の技術】近年、ワードプロセッサやパーソナルコ
ンピューターの普及、コンピュータの記憶装置の容量の
増大、コンピュータによる文字認識の実用化等に伴い、
文書中の全ての文字情報を蓄積した全文データベースが
多くなってきた。このため、大量の文字情報を蓄積し、
必要に応じて文書情報を検索する全文データベース検索
システムに対する関心が高まってきている。2. Description of the Related Art In recent years, with the spread of word processors and personal computers, an increase in the capacity of computer storage devices, and the practical use of character recognition by computers, etc.,
The number of full-text databases storing all character information in documents has increased. Therefore, a large amount of character information is accumulated,
Interest in a full-text database retrieval system that retrieves document information as needed has been growing.

【０００３】従来の文書データベースシステムでは、文
書を検索する際の鍵として、文書毎に人手により付与さ
れたキーワードを利用するキーワード検索方式が一般的
であった。しかし、キーワード付け作業が蓄積文書の増
加に間に合わない、時間が経過するとキーワードが陳腐
化する、キーワード付けを行った者と検索する者とのキ
ーワードの解釈の相違により検索もれが生じる、などの
問題点があった。このような背景から、近年、全文検索
（フルテキストサーチ）と呼ばれる文書検索方式が注目
されている。[0003] In a conventional document database system, a keyword search system using a keyword manually assigned to each document as a key for searching for a document is generally used. However, keyword addition work cannot keep up with the increase of stored documents, keywords become obsolete over time, and search omissions occur due to differences in keyword interpretation between the keyword assigner and the searcher. There was a problem. From such a background, in recent years, a document search method called full-text search (full-text search) has attracted attention.

【０００４】全文検索は、文書データのほかには補助的
な情報を持たずに、検索毎に文書データを全文走査する
「フルテキストスキャン」方式と、検索に先だって、文
書データ中に出現する文字あるいは文字列の情報を高速
に取り出せるような索引情報を自動的に作成しておい
て、検索時にこの索引を検索する方式の２種類に大別さ
れる。[0004] The full-text search includes a "full-text scan" method in which the document data is scanned every text without any auxiliary information other than the document data, and a character that appears in the document data prior to the search. Alternatively, index information is automatically created so that character string information can be extracted at a high speed, and the index is searched at the time of search.

【０００５】このうち、フルテキストスキャン方式は、
原文書以外の情報を用いないので、記憶容量が少なくて
済むとともに文書データの更新直後でも即座に検索でき
る点、および正規表現等の文字列パターンや論理条件を
含む複雑な検索条件の場合や検索結果が多い場合でも、
検索時間がほぼ一定である点が長所であるが、文書デー
タの全てを走査するため、索引方式に比べて検索速度が
遅いという問題が指摘されている。[0005] Among them, the full text scanning method is
Since information other than the original document is not used, the storage capacity is small and the search can be performed immediately after the document data is updated. In the case of complex search conditions including character string patterns such as regular expressions and logical conditions, and search Even if there are many results,
The advantage is that the search time is substantially constant, but it has been pointed out that the search speed is slower than that of the indexing method because all the document data is scanned.

【０００６】一方、索引方式は、一般にフルテキストス
キャン方式よりも検索速度が速く、索引の作成方法によ
っては、検索速度が文書量にほとんど依存しないという
利点があるが、索引情報の容量が大きいこと、索引を作
成する時間が長いこと、検索条件が複雑な場合や検索結
果が多い場合に検索速度が低下すること等の問題が指摘
されている。On the other hand, the index system generally has a higher search speed than the full-text scan system, and has the advantage that the search speed hardly depends on the amount of documents depending on the method of creating the index, but the index information has a large capacity. It has been pointed out that the time required to create an index is long and that the search speed is reduced when the search conditions are complicated or the search results are large.

【０００７】このような従来の全文検索ための文書検索
方法と索引作成方法とその特徴は、「ＡｃｃｅｓｓＭ
ｅｔｈｏｄｓｆｏｒＴｅｘｔ］（Ｏｈｒｉｓｔｏｓ
Ｆａｌｏｕｔｓｏｓ，ＣｏｍｐｕｔｉｎｇＳｕｒｖ
ｅｙｓ，Ｖｏｌ．１７，Ｎｏ．１，Ｍａｒｃｈ１９８
５）等の論文や、「テキスト検索プロセッサ」（高橋恒
介著、電子情報通信学会刊）等の成書に詳細な説明がな
されている。[0007] Such a conventional document search method and index creation method for full-text search and its features are described in "Access M."
methods for Text] (Ohristos
Falloutsos, Computing Surv
eys, Vol. 17, No. 1, March 198
Detailed descriptions are given in papers such as 5) and textbooks such as "Text Search Processor" (by Kosuke Takahashi, published by the Institute of Electronics, Information and Communication Engineers).

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、上記の
論文、成書で紹介されている従来の方法では、索引を用
いないと検索速度が上がらず、一方索引を用いると、索
引の作成・更新時間がかかる上に、索引データの容量が
大きくなり、正規表現などの複雑な文字列パターンでの
検索にも時間がかかるという課題があった。However, in the conventional methods introduced in the above-mentioned papers and compendiums, the search speed cannot be increased without using an index, while the index creation / update time can be reduced by using the index. In addition to this, there is a problem that the capacity of the index data becomes large and it takes time to search for a complicated character string pattern such as a regular expression.

【０００９】本発明は、上記の従来技術の課題を解決す
るもので、作成・更新時間が短く、容量が小さく、正規
表現などの複雑な文字列パターンでの近似検索も高速で
行なうことのできる索引作成方法とその装置および作成
された索引データとフルテキストスキャンとを組み合わ
せた検索速度の速い文書検索装置を提供することを目的
とする。SUMMARY OF THE INVENTION The present invention solves the above-mentioned problems of the prior art, in which the creation / update time is short, the capacity is small, and an approximate search using a complicated character string pattern such as a regular expression can be performed at high speed. An object of the present invention is to provide an index creation method and an apparatus therefor, and a document search apparatus having a high search speed by combining created index data and full text scan.

【００１０】[0010]

【課題を解決するための手段】上記目的を達成するため
に、本発明による索引作成方法は、サンプル文書データ
の文字および文字列の出現を統計的に調べて前記索引デ
ータを作成する際の共通情報となる索引型式データを作
成する段階と、前記索引型式データの型式に従って検索
対象文書データに関する索引データを作成する段階とか
ら成り、索引型式データ作成段階では、前記文字列の出
現を統計的に調べる動作として、一定度数以下の文字
（低頻度文字）については、１文字で索引を作成するこ
とを決定し、一定度数以上の文字（高頻度文字）につい
ては、高頻度文字同士の２文字連続を調べ、次に、一定
度数以下の２文字連続文字（低頻度２文字連続）につい
ては、２文字で索引を作成することを決定し、一定度数
以上の２文字連続（高頻度２文字連続）については、高
頻度２文字連続文字同士の３文字連続を調べる動作を順
次行なうことにより、高頻度な文字列ほど、長い文字連
続として索引を作成することを決定した内容の索引型式
データを作成するようにしたものである。In order to achieve the above object, an index creation method according to the present invention provides a method for generating index data by statistically examining the appearance of characters and character strings in sample document data. Creating index type data as information; and creating index data on the search target document data according to the type of the index type data. In the index type data creating step, the appearance of the character string is statistically determined. As a check operation, it is determined that an index is created with one character for a character having a certain frequency or less (low-frequency character), and for a character having a certain frequency or more (high-frequency character), two consecutive characters of the high-frequency character are determined. Then, for two consecutive characters having a certain frequency or less (low-frequency two consecutive characters), it is decided to create an index using two characters, and two characters having a certain frequency or more (two consecutive characters) are determined. The frequency 2 character sequence), high by frequency 2 that the characters perform successive characters sequentially three characters examine the continuous operation among high enough frequency strings, long as a character sequence of content determined to create an index Index Model data is created.

【００１１】また、本発明によると索引作成装置は、サ
ンプル文書データ中のある１文字の出現の度合を統計的
に調べる文字出現頻度算定手段と、前回調べた文字の出
現の度合が予め定められた値よりも高い場合に、前回調
べた文字の全てを含むＮ文字（Ｎは２、３、４、・・・
の自然数）の文字列についての出現の度合を統計的に調
べる複数のＮ文字連続出現頻度算定手段と、サンプル文
書データ中の文字または文字列の出現の度合に応じて文
字出現頻度算定手段および複数のＮ文字連続出現頻度算
定手段の出力から検索対象文書データに関する索引デー
タを作成する際の共通情報となる索引型式データを作成
する索引型式出力手段とを備えたものである。Further, according to the present invention, the index creation apparatus is characterized in that a character appearance frequency calculating means for statistically examining the degree of appearance of a certain character in the sample document data, and the degree of appearance of the character examined last time is predetermined. , N characters including all of the characters examined last time (N is 2, 3, 4,...)
A natural number), a plurality of N-character continuous appearance frequency calculating means for statistically examining the degree of appearance of the character string, a character appearance frequency calculating means according to the degree of appearance of the character or the character string in the sample document data, and And index type output means for generating index type data serving as common information when index data relating to the search target document data is generated from the output of the N character continuous appearance frequency calculating means.

【００１２】さらに、本発明による索引作成装置は、サ
ンプル文書データ中の文字または文字列をその出現の度
合に応じてグループ化する複数のグループ化手段を備
え、索引型式出力手段が、各グループ化手段から出力さ
れたグループ化情報を基に各グループの通し番号と所属
する文字または文字列との対応表を出力するようにした
ものである。Further, the index creating apparatus according to the present invention comprises a plurality of grouping means for grouping characters or character strings in the sample document data according to the degree of appearance thereof, and the index type output means comprises a grouping means. Based on the grouping information output from the means, a correspondence table between the serial number of each group and the character or character string to which it belongs is output.

【００１３】さらに、本発明による索引作成装置は、検
索対象文書データに関する索引データを作成する際に用
いる検索文字数を上記構成の索引作成装置から出力され
た索引型式データに従って決定する文字連続数算定手段
と、文字連続数算定手段により決定された文字数と上記
構成の索引作成装置から出力された索引型式データとか
ら対応するグループ番号を算定するグループ番号算定手
段と、グループ番号算定手段から出力されたグループ番
号からそれぞれの文書レコードの索引データを作成する
索引情報蓄積出力手段とを備えたものである。Further, the index creation device according to the present invention is a character continuation number calculating means for determining the number of search characters to be used when creating index data relating to search target document data in accordance with the index type data output from the index creation device having the above configuration. And a group number calculating means for calculating a corresponding group number from the number of characters determined by the character continuation number calculating means and the index type data output from the index creating apparatus having the above configuration; and a group output from the group number calculating means. Index information storing and outputting means for generating index data of each document record from the number.

【００１４】また、本発明による文書検索装置は、文字
列パターンを含む検索条件を入力する検索入力手段と、
検索条件から上記構成の索引作成装置から出力された索
引データを照合するための文字または文字列のＡＮＤ／
ＯＲ木を作成する索引照合条件作成手段と、索引データ
と索引照合条件作成手段が作成したＡＮＤ／ＯＲ木との
照合を行なう索引照合手段と、照合が成功した場合に、
検索対象文書データの対応する部分を検索条件入力手段
から入力された文字列パターンを含む検索条件と照合
し、照合の成功した部分を最終的な検索結果として出力
する全文走査文字列照合手段とを備えたものである。[0014] Further, the document search device according to the present invention comprises: search input means for inputting search conditions including a character string pattern;
AND / of a character or a character string for collating the index data output from the index creating apparatus with the above configuration based on the search condition
An index matching condition creating means for creating an OR tree; an index matching means for matching index data with an AND / OR tree created by the index matching condition creating means;
A full-text scanning character string matching unit that matches a corresponding part of the search target document data with a search condition including a character string pattern input from the search condition input unit, and outputs a successfully matched part as a final search result. It is provided.

【００１５】[0015]

【作用】本発明は、上記構成によって、検索対象文書デ
ータに関する索引の型式が統計的に検索対象文書データ
に適合した、容量の小さい索引データを、検索対象文書
データの統計的性質を調べることなしに高速に作成し、
また、サンプル文書中における出現の度合が、予め定め
られた値以下すなわち絞り込み率以下である低頻度文字
については、索引型式出力手段が検索対象文書データに
おける１文字の出現を記録するための索引データの型式
を指示し、サンプル文書中における出現の度合が絞り込
み率よりも高い高頻度文字については、高頻度文字に属
するＮ文字、すなわちまず初めに２つの文字からなる２
文字連続のサンプル文書データ中における出現の度合を
２文字連続出現頻度算定手段が統計的に調べ、サンプル
文書データ中における出現の度合が絞り込み率以下であ
る低頻度２文字連続については、索引型式出力手段が検
索対象文書データにおける２文字連続の出現を記録する
ための索引データの型式を指示し、サンプル文書データ
中における出現の度合が絞り込み率よりも高い高頻度２
文字連続については、高頻度２文字連続に属する２つの
２文字連続をそれぞれ初めの２文字、および最後の２文
字に持つ３文字連続のサンプル文書データ中における出
現の度合を３文字連続出現頻度算定手段が統計的に調
べ、索引型式出力手段が、検索対象文書データにおける
３文字連続の出現を記録するための索引データの型式を
指示することによって、文字および文字列の出現の度合
が異なっていても、検索条件によらずに絞り込み率以下
に検索対象文書データを絞り込むことを可能にする索引
データを作成することができる。According to the present invention, a small-capacity index data in which the type of an index relating to the retrieval target document data is statistically adapted to the retrieval target document data by the above-mentioned structure is obtained without examining the statistical properties of the retrieval target document data. Create fast
For low-frequency characters whose degree of appearance in the sample document is equal to or less than a predetermined value, that is, equal to or less than the narrowing-down rate, index data for recording the appearance of one character in the search target document data by the index type output unit. For high-frequency characters whose degree of appearance in the sample document is higher than the narrowing-down ratio, N characters belonging to the high-frequency characters, that is, 2 characters consisting of two characters first
The frequency of occurrence of two consecutive characters in the sample document data of character continuity is statistically checked by the two character continuous appearance frequency calculating means. Means for indicating the type of index data for recording the appearance of two consecutive characters in the search target document data, and the frequency of appearance in the sample document data is higher than the narrowing rate.
Regarding the character continuation, the frequency of appearance of three consecutive characters having two consecutive two characters belonging to the high frequency two consecutive characters as the first two characters and the last two characters in the sample document data is calculated as a three-character consecutive appearance frequency. The means statistically examines, and the index type output means indicates the type of index data for recording the appearance of three consecutive characters in the search target document data, so that the degrees of appearance of characters and character strings are different. Also, it is possible to create index data that enables the search target document data to be narrowed to a narrowing rate or less regardless of the search conditions.

【００１６】また、サンプル文書データ中の文字または
文字列の出現の度合を文字列出現頻度算定手段が統計的
に調べ、その後で、グループ化手段が、１つ以上の文字
または文字列からなるグループであって、当該グループ
に属する文字または文字列の少なくとも１種が出現する
度合が予め定めた絞り込み率以下であるグループに、サ
ンプル文書データ中の文字または文字列を振り分け、検
索対象文書データにおいて当該グループに所属するいず
れかの文字あるいは文字列が出現した場合には、「当該
グループに属する文字あるいは文字列のいずれかが出現
した」という情報を記録するための索引データの型式を
索引型式決定手段が決定して索引型式データを作成する
ことによって、多くの種類の低頻度文字がある場合で
も、容量の小さな索引を作成することができる。The character string appearance frequency calculating means statistically examines the degree of appearance of a character or a character string in the sample document data, and thereafter, the grouping means sets a group consisting of one or more characters or character strings. The characters or character strings in the sample document data are sorted into groups in which the degree of appearance of at least one of the characters or character strings belonging to the group is equal to or less than a predetermined narrowing-down ratio. If any character or character string belonging to the group appears, the type of index data for recording information that "any of the characters or character string belonging to the group has appeared" is determined by an index type determining means. Determines and creates indexed format data, so that even if there are many types of It is possible to create.

【００１７】さらに、文書検索装置においては、利用者
が検索条件入力手段から入力した文字列パターンを含む
検索条件から、索引照合条件作成手段が、索引データを
照合するための文字または文字列のＡＮＤ／ＯＲ木を作
成し、索引照合手段が、索引データと、索引照合条件作
成手段の作成したＡＮＤ／ＯＲ木との照合を行い、照合
が成功したデータの場合には、全文走査文字列照合手段
が、検索対象文書データの対応する部分を検索条件入力
手段から入力された文字列パターンを含む検索条件と完
全に照合し、照合の成功した部分を最終的な検索結果と
することにより、従来はフルテキストスキャン方式でし
か扱えなかった複雑な検索条件の場合でも、索引による
検索速度の高速化を実現することができる。Further, in the document retrieval apparatus, the index collation condition creating means uses the character or character string AND for collating the index data based on the retrieval condition including the character string pattern inputted by the user from the retrieval condition input means. / OR tree is created, and the index collating means collates the index data with the AND / OR tree created by the index collating condition creating means, and if the collation is successful, the full-text scanning character string collating means Conventionally, by completely matching the corresponding part of the search target document data with the search condition including the character string pattern input from the search condition input means, and making the part that has been successfully matched the final search result, Even in the case of complex search conditions that could only be handled by the full-text scan method, it is possible to increase the search speed by the index.

【００１８】[0018]

【Example】

（実施例１）以下、本発明の索引作成方法を実施するた
めの装置について、図面を参照しながら説明する。図１
は本発明の第１の実施例における索引作成装置の構成を
示すブロック図である。図１において、１０１は文書デ
ータを構成する複数の文書レコードを格納したサンプル
文書データである。サンプル文書データは、検索対象文
書データの全部または一部でもよく、検索対象文書デー
タに対し、文字および文字列の出現に関する統計的性質
が類似している他の文書データであってもよい。１０２
はサンプル文書データ１０１中の各文書レコードの位置
を記録したサンプル文書句切りデータ、１０３はサンプ
ル文書句切りデータ１０２の位置情報に従ってサンプル
文書データ１０１から指定された文書レコードを切り出
して、レコード先頭を表す特別な文字＜ＳＴＡＲＴ＞を
文書レコード先頭に付与し、レコード終了を表す特別な
文字＜ＥＮＤ＞を文書レコード末尾に付与した文字列を
出力する文書区切り手段、１０４は文書区切り手段１０
３の出力である文書レコード文字列を受け取ってサンプ
ル文書データ１０１中に出現する各文字の出現の度合
を、「当該文字の出現する文書レコードの文字数の総和
を全文書レコードの文字数の総和で除した値」として算
定する文字出現頻度算定手段、１０５は文書区切り手段
１０３の出力である文書レコード文字列と、文字出現頻
度算定手段１０４の算定結果とを受け取って、サンプル
文書データ１０１中に高頻度で出現する２文字連続の出
現の度合を、「当該２文字連続の出現する文書レコード
の文字数の総和を全文書レコードの文字数の総和で除し
た値」として算定する２文字連続出現頻度算定手段、１
０６は文書区切り手段１０３の出力である文書レコード
文字列と、２文字連続出現頻度算定手段１０５の算定結
果とを受け取って、サンプル文書データ１０１中に高頻
度で出現する３文字連続の出現の度合を、「当該３文字
連続の出現する文書レコードの文字数の総和を全文書レ
コードの文字数の総和で除した値」として算定する３文
字連続出現頻度算定手段、１０７は文字出現頻度算定手
段１０４の算定結果を受け取って、出現の度合が予め定
められた「絞り込み率」以下である複数の文字をグルー
プ化し、グループに属するいずれかの文字が出現する度
合が絞り込み率を越えない範囲で絞り込み率に最も近く
なるように調整する文字グループ化手段、１０８は２文
字連続出現頻度算定手段１０５の算定結果を受け取っ
て、出現の度合が絞り込み率以下である複数の２文字連
続をグループ化し、グループに属するいずれかの２文字
連続が出現する度合が絞り込み率を越えない範囲で絞り
込み率に最も近くなるように調整する２文字連続グルー
プ化手段、１０９は３文字連続出現頻度算定手段１０６
の算定結果を受け取って、出現の度合が絞り込み率以下
である複数の３文字連続がある場合には、これをグルー
プ化し、グループに属するいずれかの３文字連続が出現
する度合が絞り込み率を越えない範囲で絞り込み率に最
も近くなるように調整し、出現の度合が絞り込み率より
も高い３文字連続はそれ１つだけで１グループにする３
文字連続グループ化手段、１１０は文字グループ化手段
１０７と２文字連続グループ化手段１０８と３文字連続
グループ化手段１０９の出力であるグループ化情報を受
け取って、各グループに通し番号を付与し、各グループ
の通し番号と、所属文字あるいは２文字連続あるいは３
文字連続との対応表を出力する索引型式出力手段、１１
１は索引型式出力手段１１０の出力する索引型式データ
である。(Embodiment 1) Hereinafter, an apparatus for implementing an index creation method of the present invention will be described with reference to the drawings. FIG.
FIG. 1 is a block diagram illustrating a configuration of an index creation device according to a first embodiment of the present invention. In FIG. 1, reference numeral 101 denotes sample document data storing a plurality of document records constituting the document data. The sample document data may be all or part of the search target document data, or may be other document data having similar statistical properties regarding the appearance of characters and character strings to the search target document data. 102
Is sample document punctuation data in which the position of each document record in the sample document data 101 is recorded, and 103 is a specified document record cut out from the sample document data 101 in accordance with the position information of the sample document punctuation data 102, and the beginning of the record is extracted. A document delimiter that outputs a character string in which a special character <START> representing the end of the document record is added to the head of the document record and a special character <END> indicating the end of the record is added to the end of the document record.
3, the degree of appearance of each character appearing in the sample document data 101 after receiving the document record character string is determined by dividing the sum of the number of characters of the document record in which the character appears by the sum of the number of characters of all the document records. The character appearance frequency calculating means 105 which calculates the value as a “value” is received by the document record character string output from the document separating means 103 and the calculation result of the character appearance frequency calculating means 104, A two-character continuation frequency calculating means for calculating a degree of occurrence of two consecutive characters appearing as “a value obtained by dividing the total number of characters of the document records in which the two consecutive characters appear by the total number of characters of all the document records”; 1
Reference numeral 06 denotes a document record character string output from the document delimiter 103 and a calculation result of the two-character continuous appearance frequency calculator 105, and the degree of occurrence of three consecutive characters appearing frequently in the sample document data 101. Is calculated as “the value obtained by dividing the sum of the number of characters of the document record in which the three consecutive characters appear by the sum of the number of characters of all the document records”, and 107 is the calculation of the character appearance frequency calculation means 104. Receiving the result, a plurality of characters whose degree of appearance is equal to or less than a predetermined “narrowing rate” are grouped, and the degree of occurrence of any character belonging to the group does not exceed the narrowing rate. The character grouping means 108 for adjusting to be close to each other receives the calculation result of the two-character continuous appearance frequency calculating means 105 and narrows the degree of appearance. Grouping a plurality of two-character continuations that are equal to or less than the narrowing-down ratio, and adjusting the degree of occurrence of any two-character continuation belonging to the group to be the closest to the narrowing-down ratio within a range that does not exceed the narrowing-down ratio. Means 109, three character continuous appearance frequency calculating means 106
If there is a plurality of consecutive three characters whose degree of appearance is equal to or less than the narrowing rate after receiving the calculation result of, the degree of occurrence of any three consecutive characters belonging to the group exceeds the narrowing rate. Make adjustments so as to be the closest to the narrowing-down ratio within a range that does not exist, and make a group of three consecutive characters whose appearance is higher than the narrowing-down ratio.
The character continuation grouping means 110 receives the grouping information output from the character grouping means 107, the two-character continuation grouping means 108, and the three-character continuation grouping means 109, and assigns a serial number to each group. Serial number and the assigned character or two consecutive characters or 3
Index type output means for outputting a correspondence table with character continuity, 11
Reference numeral 1 denotes index type data output from the index type output unit 110.

【００１９】サンプル文書データ１０１には、図４に示
すような、書籍のＩＳＢＮ番号が１番号１文書レコード
として、５５３９６６レコード分記録されており、サン
プル文書区切りデータ１０２には、図４に示した文書デ
ータの各文書レコードの先頭の文字の、サンプル文書デ
ータ１０１先頭からの文字単位での隔たりが記録されて
いるものとし、絞り込み率として、０．０５を指定する
ものとする。As shown in FIG. 4, 553,966 ISBN numbers of books are recorded in the sample document data 101 as one-by-one document record, and the sample document delimiter data 102 shown in FIG. It is assumed that the distance of the head character of each document record of the document data from the head of the sample document data 101 in character units is recorded, and 0.05 is specified as a narrowing-down ratio.

【００２０】以上のように構成された索引作成装置につ
いて、その動作を説明する。まず、サンプル文書データ
１０１中の各文書レコードが、文書区切り手段１０３で
切り出されて、文字出現頻度算定手段１０４に送られ、
各文字の出現の度合が、当該文字の出現する文書レコー
ドの文字数の総和／全文書レコードの文字数の総和によ
って算定される。図５〜図２３は、この索引作成装置の
レポート出力であり、図５の「文字頻度表（Ｃｈａｒａ
ｃｔｅｒｈｉｓｔｇｒａｍ＆Ｍｏｎｏｇｒａｍ−
ｉｎｄｅｘ）」、図６〜図７の「高頻度２文字連連続
表」（ＦｒｅｑｕｅｎｔＤｉｇｒａｍ）、図８の「２
文字連続のグループ化結果」（Ｄｉｇｒａｍｐａｒｔ
ｉｔｉｏｎｓ）、図９〜図１８の「高頻度３文字連続
表」（ＴｒｉｇｒａｍＴａｂｌｅ）、図１９〜図２２
の「３文字連続のグループ化結果」（Ｔｒｉｇｒａｍ
ＰａｒｔｉｔｉｏｎｉｎｇＴａｂｌｅ）、図２３の
「索引の大きさ」（Ｉｎｄｅｘｓｉｚｅ）の各情報
が、順に記録されている。The operation of the index creating apparatus configured as described above will be described. First, each document record in the sample document data 101 is cut out by the document separating means 103 and sent to the character appearance frequency calculating means 104,
The degree of appearance of each character is calculated by the sum of the number of characters of the document record in which the character appears / the sum of the number of characters of all the document records. FIG. 5 to FIG. 23 show the report output of this index creation device, and refer to the “character frequency table (Chara) in FIG.
cter histgram & Monogram-
index), “High-frequency two-character continuous table” in FIGS. 6 to 7 (Frequency Digram), and “2 in FIG.
Character grouping result "(Digram part
9) to FIG. 9 to FIG. 18 (trigger table), FIG. 19 to FIG.
Of "Grouping results of three consecutive characters" (Trigram
Information of “Partitioning Table” and “index size” of FIG. 23 are sequentially recorded.

【００２１】文字出現頻度算定手段１０４の算定結果で
ある図５において、「Ｏｃｃｕｒｅｎｃｅ」項目は、着
目文字の出現する文書レコードの文字数の総和を表し、
「（Ｐｅｒｃｅｎｔ）」項目は、着目文字の出現する文
書レコードの文字数の総和を全文書レコードの文字数の
総和で除した値に１００を乗じた値を表し、「Ｒａｎ
ｋ」項目は、着目文字の出現度合による順序を表し、
「Ｃｈａｒ」項目は、着目文字を表し、「Ｍｏｎｏｇｒ
ａｍ−ｉｎｄｅｘ」項目は、出現度合によってグループ
化された文字グループ番号を表す。例えば、「Ｃｈａ
ｒ」項目の文字「−」は、第４番目に高頻度の文字であ
り、出現文書レコードの文字数の総和は、３６９９２４
１文字であり、８７．２９％の出現頻度を持つ。なお、
このサンプル文書データの場合には、文字種が１６種と
少ないため、２文字以上からなる文字グループは存在し
ない。In FIG. 5, which is the calculation result of the character appearance frequency calculation means 104, the item “Occurrence” indicates the total number of characters of the document record in which the character of interest appears.
The “(Percent)” item represents a value obtained by multiplying 100 by a value obtained by dividing the total number of characters of the document record in which the character of interest appears by the total number of characters of all the document records.
The “k” item indicates the order of appearance of the character of interest,
The “Char” item indicates a character of interest, and “Monogr”
The “am-index” item indicates character group numbers grouped according to the degree of appearance. For example, "Cha
The character "-" in the "r" item is the fourth most frequent character, and the total number of characters in the appearing document record is 369924.
It is one character and has an appearance frequency of 87.29%. In addition,
In the case of this sample document data, since the character types are as small as 16 types, there is no character group including two or more characters.

【００２２】こうして、サンプル文書データ１０１の１
回目の走査が終了したら、文書区切り手段１０３は、サ
ンプル文書データ１０１の２回目の走査を開始し、切り
出した文書レコードを２文字連続出現頻度算定手段１０
５に送る。２文字連続出現頻度算定手段１０５は、文書
レコード中の２文字連続のうちで、高頻度文字（この例
の場合には、全ての文字種が高頻度文字に当たる）同士
の連続のみを抽出し、各２文字連続の出現の度合が、
「当該２文字連続の出現する文書レコードの文字数の総
和／全文書レコードの文字数の総和」によって算定され
る。そのうちで、出現の度合が絞り込み率よりも高い、
高頻度の２文字連続を表示したのが、図６〜図７の「高
頻度２文字連続表」である。As described above, 1 of the sample document data 101
When the second scanning is completed, the document delimiter 103 starts the second scanning of the sample document data 101 and converts the cut document record into a two-character continuous appearance frequency calculating unit 10.
Send to 5. The two-character continuation frequency calculation means 105 extracts only continuation of high-frequency characters (in this example, all character types correspond to high-frequency characters) from two-character continuation in the document record. The appearance of two consecutive characters is
It is calculated by “the sum of the number of characters of the document record in which the two consecutive characters appear / the sum of the number of characters of all the document records”. Among them, the degree of appearance is higher than the refinement rate,
The "high-frequency two-character continuation table" shown in FIGS.

【００２３】図６において、「Ｎｏ．」項目は、高頻度
２文字連続の通し番号を表し、「Ｄｉｇｒａｍ−Ｃｏｄ
ｅ」項目は、２文字連続を構成する第１文字、第２文字
を文字グループ番号で表現した組を表し、「Ｄｉｇｒａ
ｍ−Ｃｈａｒａｃｔｅｒ」項目は、２文字連続を構成す
る文字列を表し、「Ｏｃｃ」項目は着目２文字連続の出
現する文書レコードの文字数の総和を表し、「（Ｐｅｒ
ｃｅｎｔ）」項目は、着目２文字連続の出現する文書レ
コードの文字数の総和を全文書レコードの文字数の総和
で除した値に１００を乗じた値を表す。例えば、通し番
号が６である「−４」という２文字連続は、出現の度合
が１７．５７％であることがわかる。高頻度文字同士か
らなる２文字連続のうち、高頻度２文字連続以外のすべ
てを、文字グループ番号で第１文字、第２文字を表現し
た際の文字列の昇順に並べ、その後で、並び順が近接す
る２文字連続を、２文字連続グループ化手段１０８がグ
ループにまとめる。グループ化の際の、グループのいず
れかの２文字連続が表れる度合の算定法は、グループ内
の各２文字連続の出現が統計的に独立であると仮定し、
以下の式から求める。In FIG. 6, the “No.” item indicates a serial number of two consecutive high-frequency characters, and “Digram-Cod”.
The item “e” represents a set in which the first character and the second character constituting two consecutive characters are represented by a character group number, and “Digra”
The “m-Character” item represents a character string forming two consecutive characters, and the “Occ” item represents the total number of characters of a document record in which two consecutive characters of interest appear, and “(Per
The “cent)” item represents a value obtained by multiplying the value obtained by dividing the total number of characters of the document records in which two consecutive characters of interest appear by the total number of characters of all the document records by 100. For example, it can be seen that the appearance degree of two consecutive characters “-4” having the serial number 6 is 17.57%. Of the two consecutive characters consisting of high-frequency characters, all but the two consecutive high-frequency characters are arranged in ascending order of the character string when the first character and the second character are represented by the character group number, and then the arrangement order Are grouped by the two-character continuation grouping means 108. When grouping, the method of calculating the degree to which any two consecutive characters of a group appear is based on the assumption that the appearance of each two consecutive characters in the group is statistically independent.
It is calculated from the following equation.

【００２４】[0024]

【数１】ただし、Ｐはグループ内のｎ個の２文字連続のいずれか
が現れる度合であり、Ｐ_j（ｊ＝１，２，・・・ｎ）は
グループ内のｊ番目の２文字連続が現れる度合である。(Equation 1) Here, P is the degree of appearance of any of n consecutive two characters in the group, and P _j (j = 1, 2,... N) is the degree of appearance of the jth two consecutive characters in the group. is there.

【００２５】その結果が、図８の「２文字連続のグルー
プ化結果」である。図８において、「Ｎｏ．」項目は、
２文字連続のグループ番号を表し、「Ｄｉｇｒａｍ−Ｃ
ｏｄｅ」項目はグループとグループの境界に位置する当
該グループ中で最も文字列順の大きい２文字連続の第１
文字、第２文字の文字グループ番号を表し、「Ｄｉｇｒ
ａｍ−Ｃｈａｒａｃｔｅｒ」項目は当該２文字連続を構
成する文字列を表す。例えば、２文字連続のグループ番
号７には、２文字連続「２２」および「２３」が含ま
れ、２文字連続のグループ番号８には、２文字連続「２
５」、「２６」、「２７」、「２９」が含まれる。The result is the “grouping result of two consecutive characters” in FIG. In FIG. 8, the "No."
Represents a group number consisting of two consecutive characters, "Digram-C
The “mode” item is the first character of a two-character continuation having the largest character string order among the groups located at the boundary between the groups.
Represents the character group number of the character and the second character, "Digr
The “am-Character” item indicates a character string that forms the two-character continuation. For example, two-character continuous group number 7 includes two-character continuous “22” and “23”, and two-character continuous group number 8 includes two-character continuous “2”.
5 "," 26 "," 27 ", and" 29 ".

【００２６】こうして、サンプル文書データ１０１の２
回目の走査が終了したら、文書区切り手段１０３は、サ
ンプル文書データ１０１の３回目の走査を開始し、切り
出した文書レコードを３文字連続出現頻度算定手段１０
６に送る。３文字連続出現頻度算定手段１０６は、文書
レコード中の３文字連続のうちで、（第１文字、第２文
字）および（第２文字、第３文字）がいずれも高頻度２
文字連続である３文字連続のみを抽出し、各３文字連続
の出現の度合が、「当該３文字連続の出現する文書レコ
ードの文字数の総和／全文書レコードの文字数の総和」
によって算定され、その結果が３文字連続グループ化手
段１０９に送られ、式（１）と同様の基準によって、絞
り込み率をもとにグループ化される。その結果を表示し
たのが図９〜図１８の「高頻度３文字連続表」および図
１９〜図２２の「３文字連続のグループ化結果」であ
る。Thus, the sample document data 101-2
When the first scanning is completed, the document separating unit 103 starts the third scanning of the sample document data 101, and converts the extracted document record into the three-character continuous appearance frequency calculating unit 10.
Send to 6. The three-character consecutive appearance frequency calculation unit 106 determines that (first character, second character) and (second character, third character) are both high-frequency two out of three consecutive characters in the document record.
Only three consecutive characters, which are consecutive characters, are extracted, and the degree of appearance of each consecutive three characters is determined as “the total number of characters of the document records in which the consecutive three characters appear / the total number of characters of all the document records”.
Is calculated, and the result is sent to the three-character continuous grouping means 109, and is grouped based on the narrowing-down ratio according to the same criterion as in the equation (1). The results are displayed in the “high-frequency three-character continuation table” in FIGS. 9 to 18 and the “three-character continuation grouping result” in FIGS. 19 to 22.

【００２７】図９において、「Ｎｏ．」項目は、３文字
連続の通し番号を表し、「Ｇｒｏｕｐ」項目は、３文字
連続のグループ番号を表し、「Ｔｒｉｇｒａｍ−Ｃｏｄ
ｅ」項目は、３文字連続の第１文字、第２文字、第３文
字の文字グループ番号を表し、「Ｔｒｉｇｒａｍ−Ｃｈ
ａｒａｃｔｅｒ」項目は、当該３文字連続を構成する文
字列を表し、「Ｏｃｃ」項目は、着目３文字連続の出現
する文書レコードの文字数の総和を表し、「（Ｐｅｒｃ
ｅｎｔ）」項目は、着目３文字連続の出現する文書レコ
ードの文字数の総和を全文書レコードの文字数の総和で
除した値に１００を乗じた値を表す。In FIG. 9, the "No." item represents a serial number of three consecutive characters, the "Group" item represents a group number of three consecutive characters, and "Trigram-Cod".
The “e” item indicates a character group number of a first character, a second character, and a third character of three consecutive characters, and “Trigram-Ch”
The “aractor” item represents a character string that forms the three consecutive characters, and the “Occ” item represents the total number of characters of a document record in which the three consecutive characters of interest appear, and “(Perc
ent) "item represents a value obtained by multiplying 100 by a value obtained by dividing the total number of characters of the document records in which three consecutive characters of interest appear by the total number of characters of all the document records.

【００２８】また、図１９において、「Ｎｏ．」項目
は、３文字連続のグループ番号を表し、「Ｔｒｉｇｒａ
ｍ−Ｃｏｄｅ」項目は、グループとグループの境界に位
置する当該グループ中で文字グループ番号で計った文字
列順の大きい２文字連続の第１文字、第２文字の文字グ
ループ番号を表し、「Ｔｒｉｇｒａｍ−Ｃｈａｒａｃｔ
ｅｒ」項目は、当該３文字連続を構成する文字列を表
す。例えば、３文字連続のグループ番号１３８には、３
文字連続「２０２」、「２０３」、「２０５」、「２０
６」、「２０７」、「２０９」、「２０４」が所属す
る。In FIG. 19, the item “No.” represents a group number of three consecutive characters, and “Trigraph”.
The “m-Code” item indicates the first and second character group numbers of two consecutive characters in large character string order measured by the character group number in the group located at the boundary between the groups. -Charact
The “er” item represents a character string constituting the three consecutive characters. For example, a group number 138 having three consecutive characters has 3
Character continuation "202", "203", "205", "20"
6 "," 207 "," 209 ", and" 204 ".

【００２９】こうして、得られたグループ化情報が、索
引型式出力手段１１０に送られ、低頻度文字グループ、
２文字連続グループ、３文字連続グループの１つ１つに
対して、１ｂｉｔの索引情報を割り当てるような索引型
式を、索引型式データ１１１に出力する。作成される索
引の文書レコード当りの大きさを表示したものが、図２
３の「索引の大きさ」である。The obtained grouping information is sent to the index type output means 110, and the low-frequency character group,
An index type that assigns 1-bit index information to each of the two-character continuous group and the three-character continuous group is output to the index type data 111. FIG. 2 shows the size of the created index per document record.
No. 3 “index size”.

【００３０】図２３において、「１）Ｍｏｎｏｇｒａｍ
−ｉｎｄｅｘ：」項目は、低頻度文字索引の大きさを表
し、「２）Ｄｉｇｒａｍ−ｉｎｄｅｘ：」項目は、２文
字連続の索引の大きさを表し、「３）Ｔｒｉｇｒａｍ−
ｉｎｄｅｘ：」項目は、３文字連続の索引の大きさを表
し、「ＴｏｔａｌＩｎｄｅｘｓｉｚｅ：」項目は、
１）２）３）を合計した１文書レコード当りの索引のサ
イズを表す。この例では、１文書レコードあたり、合計
３２バイトの索引が作成される。In FIG. 23, "1) Monogram
The “-index:” item indicates the size of the low-frequency character index, “2) Digram-index:” indicates the size of the index of two consecutive characters, and “3) Trigram-
The “index:” item indicates the size of an index of three consecutive characters, and the “Total Index size:” item is
1) represents the size of the index per document record obtained by summing up 2) and 3). In this example, an index of a total of 32 bytes is created for one document record.

【００３１】以上のように、本実施例によれば、サンプ
ル文書データの文字および文字列の出現の度合から、多
く出現する文字については、２文字連続情報を用いて、
より詳細な索引情報をつくり、その中でより多く出現す
る２文字連続には、３文字連続情報を用いてさらに詳細
な索引情報をつくることで、高頻度で出現する文字およ
び文字列に対して高精度な索引型式データを作成でき、
逆に、あまり出現しない文字および文字列については、
グループ化によって、索引情報の容量を縮小した索引型
式データを作成することができる。As described above, according to the present embodiment, from the degree of appearance of characters and character strings in sample document data, for characters that frequently appear, two-character continuation information is used.
By creating more detailed index information and creating more detailed two-character continuation information using the three-letter continuity information, more frequently occurring characters and character strings can be created. High-precision index type data can be created,
Conversely, for characters and strings that do not appear often,
By grouping, index type data in which the capacity of index information is reduced can be created.

【００３２】（実施例２）次に、本発明の第２の実施例
について、図面を参照しながら説明する。図２は本発明
の第２の実施例における索引作成装置の構成を示すブロ
ック図である。図２において、２０１は複数の文書レコ
ードを格納した検索対象文書データ、２０２は検索対象
文書データ２０１中の各文書レコードの位置を記録した
検索対象文書句切りデータ、２０３は検索対象文書句切
りデータ２０２の位置情報に従って検索対象文書データ
２０１から指定された文書レコードを切り出して、レコ
ード先頭を表す特別な文字＜ＳＴＡＲＴ＞を文書レコー
ド先頭に付与し、レコード終了を表す特別な文字＜ＥＮ
Ｄ＞を文書レコード末尾に付与した文字列を出力する文
書区切り手段、２０４は図１に示した索引作成装置によ
り作成された索引型式データ、２０５は文書区切り手段
２０３の出力である文書レコード文字列を受け取って、
検索対象文書データ２０１中に出現する各文字から始ま
る文字列の、索引作成時に用いる検索文字数が１である
か２であるか３であるかを、索引型式データ２０４に従
って決定する文字連続数算定手段、２０６は文字連続数
算定手段２０５の算定結果である文字数と、文字列およ
び索引型式データのグループの定義とを受け取って、対
応するグループ番号を算定するグループ番号算定手段、
２０７はグループ番号算定手段２０６の出力であるグル
ープ番号を受け取って、１文書レコードの索引情報を作
成して出力する索引情報蓄積出力手段、２０８は索引情
報蓄積出力手段２０７が出力する検索対象データ２０１
に関する索引データである。(Embodiment 2) Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a block diagram showing the configuration of the index creation device according to the second embodiment of the present invention. In FIG. 2, reference numeral 201 denotes search target document data storing a plurality of document records, 202 denotes search target document punctuation data in which the position of each document record in the search target document data 201 is recorded, and 203 denotes search target document punctuation data. A specified document record is cut out from the search target document data 201 according to the position information of 202, a special character <START> indicating the head of the record is added to the head of the document record, and a special character <EN indicating the end of the record is added.
D> a document delimiter for outputting a character string with the document record added to the end of the document record; 204, index type data created by the index creation device shown in FIG. Receiving
A character continuation number calculating means for determining, based on the index type data 204, whether the number of search characters to be used at the time of creating an index of a character string starting from each character appearing in the search target document data 201 is 1, 2 or 3. , 206 are group number calculating means for receiving the number of characters as the calculation result of the character continuation number calculating means 205 and the definition of the group of character strings and index type data, and calculating the corresponding group number;
Reference numeral 207 denotes an index information storage and output unit that receives the group number output from the group number calculation unit 206 and creates and outputs index information of one document record. Reference numeral 208 denotes search target data 201 output by the index information storage and output unit 207.
Is index data for

【００３３】以上のように構成された索引作成装置につ
いて、その動作を、図２４に示すこの索引作成装置が動
作する際に出力したレポート出力を例にして説明する。
図２４において、ＲｅｃｏｒｄＮｏ，２の「＜ＳＴＡ
ＲＴ＞４−５８７−５１１５１−Ｘ＜ＥＮＤ＞」は、検
索対象データの第２レコードの切り出し結果である。こ
の文字列が文字連続数算定手段２０５に送られると、ま
ず、各文字を文字グループ番号に直し、次に索引型式デ
ータ２０４にしたがって、文字連続数を算定する。この
例の文字列は、文字グループ番号で表現すると、「０、
２、３、９、６、１１、３、７、７、１０、５、４、
３、１０、１」となる。そして、先頭の「＜ＳＴＡＲＴ
＞，４」なる２文字が２文字連続として、文字グループ
番号の組［０−２］で切り出され、グループ番号算定手
段２０６が、これからグループ番号０を算定し、索引情
報蓄積出力手段２０７が、これを受け取って、内部のビ
ット列の０番目のものを、１６進「００００」から１６
進「０００１」に変える。先頭から２文字目の「４，
−，５」なる３文字の場合は、３文字連続として、文字
グループ番号の組［２−３−９］で切り出され、グルー
プ番号算定手段２０６が、これからグループ番号７２を
算定し、索引情報蓄積出力手段２０７が、これを受け取
って、内部のビット列の４番目のものを、１６進「００
００」から１６進「０１００」に変える。このようにし
て、着目文字を次々と移動させながら、索引情報蓄積出
力手段２０７の内部のビット列に、第２レコードの索引
情報をビット列の形で蓄積する。最後の文字の処理が終
了した場合には、蓄積したビット列を、索引データ２０
８に出力する。以上の処理を各文書レコードに対して次
々に行うことにより、最終的に、検索対象データ１０１
内の全文書レコードに関する索引情報を、索引データ２
０８に格納し、索引作成処理を終了する。The operation of the index creating apparatus configured as described above will be described with reference to an example of a report output output when the index creating apparatus shown in FIG. 24 operates.
In FIG. 24, “<STA” of Record No. 2
RT> 4-587-5151-X <END> ”is a result of extracting the second record of the search target data. When this character string is sent to the character continuation number calculating means 205, first, each character is converted into a character group number, and then the character continuation number is calculated according to the index type data 204. The character string in this example can be expressed as “0,
2, 3, 9, 6, 11, 3, 7, 7, 10, 5, 4,
3, 10, 1 ". Then, at the beginning "<START
>, 4 ”is cut out as a two-character continuation by a set of character group numbers [0-2], the group number calculation means 206 calculates the group number 0 from this, and the index information accumulation and output means 207 Receiving this, the 0th bit string inside is changed from hexadecimal “0000” to 16
Change to hexadecimal "0001". "4," the second character from the beginning
In the case of three characters “−, 5”, the characters are cut out as a group of three characters consecutively by a set of character group numbers [2-3-9], and the group number calculation means 206 calculates the group number 72 from this and stores the index information. The output means 207 receives this, and outputs the fourth internal bit string to hexadecimal "00".
"00" to hexadecimal "0100". In this way, the index information of the second record is accumulated in the bit string inside the index information accumulation and output means 207 in the form of a bit string while moving the character of interest one after another. When the processing of the last character is completed, the accumulated bit string is stored in the index data 20.
8 is output. By sequentially performing the above processing for each document record, finally, the search target data 101
Index information about all document records in the index data 2
08, and the index creation processing ends.

【００３４】このように、本実施例の索引作成装置によ
れば、索引型式データ２０４が、検索対象文書データと
文字および文字列の出現の度合が類似している場合に
は、索引型式データ２０４内の統計情報を用いて、検索
対象文書データを調べることなしに、多く出現する文字
については、２文字連続情報を用いてより詳細な索引情
報を作り、その中でより多く出現する２文字連続には、
３文字連続情報を用いてさらに詳細な索引情報をつくる
ことで、高頻度で出現する文字および文字列に対して、
高精度な索引データを作成でき、逆に、あまり出現しな
い文字および文字列については、グループ化によって、
索引情報の容量を縮小した索引データを作成することが
できる。As described above, according to the index creating apparatus of this embodiment, if the index type data 204 is similar in the appearance of characters and character strings to the search target document data, the index type data 204 Without examining the search target document data using the statistical information in, for more characters, create more detailed index information using the two-character continuous information, and use the two-character continuous In
By creating more detailed index information using three-character continuous information, characters and character strings that appear frequently
High-precision index data can be created. Conversely, for characters and character strings that do not appear often,
Index data in which the capacity of index information is reduced can be created.

【００３５】（実施例３）次に、本発明の第３の実施例
について、図面を参照しながら説明する。図３は本発明
の文書検索方法を用いた文書検索装置の一実施例を示す
ブロック図である。図３において、３０１は複数の文書
レコードを格納した検索対象文書データ、３０２は利用
者が検索条件を入力する検索条件入力手段、３０３は検
索対象文書データ３０１に関する索引情報を記録した図
２の索引作成装置を用いて作成した索引データ、３０４
は検索対象文書データ３０１を走査して、検索条件入力
手段３０２から入力された検索条件と照合する文書レコ
ードを出力する全文走査文字列照合手段、３０５は検索
条件入力手段３０２から入力された検索条件を、索引デ
ータ３０３が取り扱える検索条件に変形する索引照合条
件作成手段、３０６は索引データ３０３と、索引照合条
件作成手段３０５の作成した索引照合条件との照合を行
って、照合した文書レコードの情報を、全文走査文字列
照合手段３０４に通知する索引照合手段、３０７は全文
走査文字列照合手段３０４が出力する検索結果である。Embodiment 3 Next, a third embodiment of the present invention will be described with reference to the drawings. FIG. 3 is a block diagram showing one embodiment of a document search device using the document search method of the present invention. 3, reference numeral 301 denotes search target document data storing a plurality of document records; 302, search condition input means for a user to input search conditions; 303, an index shown in FIG. Index data created using the creation device, 304
Is a full-text scanning character string matching unit that scans the search target document data 301 and outputs a document record that matches the search condition input from the search condition input unit 302, and 305 denotes a search condition input from the search condition input unit 302. Is converted into a search condition that can be handled by the index data 303. An index collation condition creating unit 306 collates the index data 303 with the index collation condition created by the index collation condition creating unit 305, and obtains information of the collated document record. Is sent to the full-text scanning character string matching means 304, and 307 is a search result output by the full-text scanning character string matching means 304.

【００３６】以上のように構成された文書検索装置につ
いて、その動作を、図２５の検索例により説明する。図
２５において、「キーワード？」の次の文字列が、利用
者が検索条件入力手段３０２を用いて入力した検索条件
で、この例では、正規表現「１１５［１−３］−Ｘ」が
入力されている。この検索条件の解釈は、「１１５１−
Ｘ」か、「１１５２−Ｘ」か、あるいは「１１５３−
Ｘ」のいずれかがレコード中に含まれる文書データを全
て求めよ、ということである。この検索条件が索引照合
条件作成手段３０５に送られると、図２５の「Ｍａｔｃ
ｈｉｎｇＶｅｃｔｏｒ」以下で示されているように、
検索照合条件がベクトルに埋め込まれたＡＮＤ／ＯＲ木
の型式で求まる。このベクトルの各要素は（位置−オフ
セット−ビット列）の情報を持つ。その解釈は、図２６
および図２７のようになる。この例では、例えば「１１
５」の３文字連続に対応する要素が（５−１０−１００
０）で、このうち、「１０」と１６進「１０００」で、
文書レコードに対応する索引情報のビット列中のビット
を特定する。このベクトル型式の検索照合条件が、索引
照合手段３０６に送られ、索引データ３０３と照合さ
れ、図２５の「Ｉｎｄｅｘｍａｔｃｈ」以下のよう
に、「ＲｅｃｏｒｄＮｏ．３（４−５８７−５１１５
１−Ｘ）」や「ＲｅｃｏｒｄＮｏ．１０３４７（４−
０９−１５１８０１−Ｘ）」などの文書レコードが照合
し、このレコードの位置情報が全文走査文字列照合手段
３０４に送られる。全文走査文字列照合手段３０４は，
この索引照合手段３０６が照合に成功した文書レコード
の位置情報と、検索対象文書データ３０１の文書情報を
もとに、必要な文書レコードを読み込み、検索条件入力
手段３０２から入力された検索条件、この例では正規表
現「１１５［１−３］−Ｘ」との文字列照合を行い、図
２５の「Ｒｅｓｕｌｔ」のような、最終的な結果を得
て、検索結果３０７に格納し、文書検索処理を終了す
る。The operation of the above-configured document search apparatus will be described with reference to a search example shown in FIG. In FIG. 25, the character string next to “keyword?” Is a search condition input by the user using the search condition input unit 302. In this example, the regular expression “115 [1-3] -X” is input. Have been. The interpretation of this search condition is “1151-
X ”,“ 1152-X ”, or“ 1153-
X ”means that all the document data included in the record is obtained. When this search condition is sent to the index matching condition creating means 305, “Matc” in FIG.
hing Vector ", as shown below,
The search collation condition is determined by the type of the AND / OR tree embedded in the vector. Each element of this vector has (position-offset-bit string) information. The interpretation is shown in FIG.
And FIG. 27. In this example, for example, “11
The element corresponding to three consecutive characters of "5" is (5-10-100
0), of which "10" and hexadecimal "1000"
The bit in the bit string of the index information corresponding to the document record is specified. The search collation condition of this vector type is sent to the index collation means 306, collated with the index data 303, and as shown in "Index match" in FIG. 25, "Record No. 3 (4-587-5115)".
1-X) ”and“ Record No. 10347 (4-
09-151801-X) ", and the position information of this record is sent to the full-text scanning character string matching means 304. The full-text scanning character string matching unit 304
Based on the position information of the document record successfully matched by the index matching unit 306 and the document information of the search target document data 301, a necessary document record is read, and the search condition input from the search condition input unit 302. In the example, a character string collation with the regular expression “115 [1-3] -X” is performed, a final result like “Result” in FIG. 25 is obtained, stored in the search result 307, and the document search process is performed. To end.

【００３７】このように、本実施例の文書検索装置によ
れば、索引容量が小さく、正規表現などの複雑な検索条
件にも対応可能な本発明の索引作成装置を援用して作成
した索引データを用いて、従来はフルテキストスキャン
方式でしか扱えなかった複雑な検索条件の場合でも、高
速に文書検索を実行することができる。As described above, according to the document search apparatus of the present embodiment, the index data created with the help of the index creation apparatus of the present invention which has a small index capacity and can cope with complicated search conditions such as regular expressions. , It is possible to execute a high-speed document search even in the case of a complicated search condition which can be conventionally handled only by the full text scan method.

【００３８】[0038]

【発明の効果】本発明は、上記各実施例から明らかなよ
うに、検索対象文書データに関する索引データを作成す
る際に、サンプル文書データの文字および文字列の出現
を統計的に調べて前記索引データを作成する際の共通情
報となる索引型式データを作成し、前記索引型式データ
の型式に従って検索対象文書データに関する索引データ
を作成するとともに、索引型式データ作成段階では、前
記文字列の出現を統計的に調べる動作として、一定度数
以下の文字（低頻度文字）については、１文字で索引を
作成することを決定し、一定度数以上の文字（高頻度文
字）については、高頻度文字同士の２文字連続を調べ、
次に、一定度数以下の２文字連続文字（低頻度２文字連
続）については、２文字で索引を作成することを決定
し、一定度数以上の２文字連続（高頻度２文字連続）に
ついては、高頻度２文字連続文字同士の３文字連続を調
べる動作を順次行なうことにより、高頻度な文字列ほ
ど、長い文字連続として索引を作成することを決定した
内容の索引型式データを作成するようにしたので、作成
・更新時間が短く、容量が小さく、正規表現などの複雑
な文字列パターンでの近似検索も高速で行なうことので
きる索引作成方法とその装置および作成された索引デー
タとフルテキストスキャンとを組み合わせた検索速度の
速い文書検索装置を実現することができる。As is apparent from the above embodiments, the present invention statistically examines the appearance of characters and character strings in sample document data when creating index data relating to search target document data and performs the index search. In addition to creating index type data as common information when creating data, creating index data relating to the search target document data in accordance with the type of the index type data, in the index type data creating step, the appearance of the character string is statistically determined. As an operation to check the characteristics, it is determined that an index is created with one character for a character having a certain frequency or less (low-frequency character), and for a character having a certain frequency or more (high-frequency character), two characters between the high-frequency characters are determined. Check character continuity,
Next, it is determined that an index should be created with two characters for two consecutive characters having a certain frequency or less (low-frequency two consecutive characters), and for two characters having a certain frequency or more (two consecutive high-frequency characters), By sequentially performing an operation of checking for three consecutive characters between two consecutive high-frequency characters, it was determined that an index was created as a longer character sequence for a higher-frequency character string .
An index creation method and device that creates index type data of the contents, so that the creation / update time is short, the capacity is small, and an approximate search with a complex character string pattern such as a regular expression can be performed at high speed. Further, it is possible to realize a high-speed document search apparatus that combines created index data and full-text scan.

【００３９】特に、文字および文字列の出現の度合があ
まり変わらない多数の検索対象文書がある場合や、検索
対象文書の更新がひんぱんに行われる場合などは、一旦
索引型式データを作成しておけば、きわめて高速に、小
容量の索引データが作成でき、検索条件の制約なしに、
フルテキストスキャンの高速化を図ることができ、その
効果は大きい。ちなみに本発明による索引データを用い
れば、全国紙の新聞１年分をキーワード１個で検索した
場合、検索速度を従来の２０倍程度も向上させることが
できる。In particular, when there are a large number of documents to be searched in which the degree of appearance of characters and character strings does not change much, or when the documents to be searched are frequently updated, the index format data may be created once. You can create very small amounts of index data very quickly,
The speed of full-text scanning can be increased, and the effect is great. By the way, if the index data according to the present invention is used, when one year's worth of newspapers of the national newspaper is searched with one keyword, the search speed can be improved by about 20 times compared with the related art.

[Brief description of the drawings]

【図１】本発明の第１の実施例における索引型式作成装
置の構成を示すブロック図FIG. 1 is a block diagram showing a configuration of an index type creating apparatus according to a first embodiment of the present invention.

【図２】本発明の第２の実施例における索引作成装置の
構成を示すブロック図FIG. 2 is a block diagram illustrating a configuration of an index creation device according to a second embodiment of the present invention.

【図３】本発明の第３の実施例における文書検索装置の
構成を示すブロック図FIG. 3 is a block diagram illustrating a configuration of a document search device according to a third embodiment of the present invention.

【図４】第１の実施例におけるサンプル文書データの一
部を示す一覧図FIG. 4 is a list diagram showing a part of sample document data in the first embodiment.

【図５】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 5 is a list showing a report output relating to an index type creation process in the first embodiment.

【図６】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 6 is a list diagram showing a report output related to an index type creation process in the first embodiment.

【図７】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 7 is a list diagram showing a report output related to an index type creation process in the first embodiment.

【図８】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 8 is a list diagram showing a report output related to an index type creation process in the first embodiment.

【図９】第１の実施例における索引型式作成処理に関す
るレポート出力を示す一覧図FIG. 9 is a list diagram showing a report output related to the index type creation processing in the first embodiment.

【図１０】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 10 is a list diagram showing a report output related to an index type creation process in the first embodiment.

【図１１】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 11 is a list showing a report output relating to an index type creation process in the first embodiment.

【図１２】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 12 is a list diagram showing a report output related to an index type creation process in the first embodiment.

【図１３】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 13 is a list diagram showing a report output related to the index type creation processing in the first embodiment.

【図１４】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 14 is a list diagram showing a report output related to the index type creation processing in the first embodiment.

【図１５】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 15 is a list diagram showing a report output related to the index type creation processing in the first embodiment.

【図１６】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 16 is a list showing a report output relating to an index type creation process in the first embodiment.

【図１７】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 17 is a list showing a report output relating to the index type creation processing in the first embodiment.

【図１８】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 18 is a list diagram showing a report output related to the index type creation processing in the first embodiment.

【図１９】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 19 is a view showing a list of reports output regarding the index type creation processing in the first embodiment.

【図２０】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 20 is a list showing a report output relating to the index type creation processing in the first embodiment.

【図２１】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 21 is a list showing report output related to index type creation processing in the first embodiment.

【図２２】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 22 is a list showing a report output related to the index type creation processing in the first embodiment.

【図２３】第１の実施例における索引型式作成処理に関
するレポート出力を示す一覧図FIG. 23 is a list diagram showing a report output related to an index type creation process in the first embodiment.

【図２４】第２の実施例における索引作成処理に関する
レポート出力を示す一覧図FIG. 24 is a list showing report output related to index creation processing in the second embodiment.

【図２５】第３の実施例における文書検索装置の検索例
を示す一覧図FIG. 25 is a list diagram showing a search example of the document search device in the third embodiment.

【図２６】第３の実施例における索引照合条件の形式と
解釈を説明するための一覧図FIG. 26 is a list for explaining the format and interpretation of index matching conditions in the third embodiment;

【図２７】第３の実施例における索引照合条件の形式と
解釈を説明するための一覧図FIG. 27 is a list for explaining the format and interpretation of index matching conditions in the third embodiment;

[Explanation of symbols]

１０１サンプル文書データ１０２サンプル文書区切りデータ１０３文書区切り手段１０４文字出現頻度算定手段１０５２文字連続出現頻度算定手段１０６３文字連続出現頻度算定手段１０７文字グループ化手段１０８２文字連続グループ化手段１０９３文字連続グループ化手段１１０索引型式出力手段１１１索引型式データ２０１検索対象データ２０２検索対象文書区切りデータ２０３文書区切り手段２０４索引型式データ２０５文字連続数算定手段２０６グループ番号算定手段２０７索引情報蓄積出力手段２０８索引データ３０１検索対象データ３０２検索条件入力手段３０３索引データ３０４全文走査文字列照合手段３０５索引照合条件作成手段３０６索引照合手段３０７検索結果 101 sample document data 102 sample document separation data 103 document separation means 104 character appearance frequency calculation means 105 two-character continuous appearance frequency calculation means 106 three-character continuous appearance frequency calculation means 107 character grouping means 108 two-character continuous grouping means 109 three characters Continuous grouping means 110 Index type output means 111 Index type data 201 Search target data 202 Search target document delimiter data 203 Document delimiter means 204 Index type data 205 Character continuation number calculation means 206 Group number calculation means 207 Index information accumulation output means 208 Index Data 301 Search target data 302 Search condition input means 303 Index data 304 Full text scanning character string matching means 305 Index matching condition creating means 306 Index matching means 307 Search results

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平５−174064（ＪＰ，Ａ) 菊池，「日本語文書用高速全文検索の一手法」，情報学基礎，Ｎｏ．25−２, 1992年５月12日，ｐ．９−16 安藤、菅野、伊藤、田村、鶴林、早川，「フルテキストデータベースシステム「検蔵君」」，ＡｄｖａｎｃｅｄＤａｔａｂａｓｅＳｙｓｔｅｍＳｙｍｐｏｓｉｕｍ’90，1990年12月５日, ｐ．17−25 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 G06F 17/21 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-5-174064 (JP, A) Kikuchi, "A technique for high-speed full-text search for Japanese documents," Informatics Basics, No. 25-2, May 12, 1992, p. 9-16 Ando, Sugano, Ito, Tamura, Tsurubayashi, Hayakawa, "Full-text database system" Kenzo-kun "", Advanced Data System System Symposium '90, December 5, 1990, p. 17-25 (58) Field surveyed (Int. Cl. ⁷ , DB name) G06F 17/30 G06F 17/21 JICST file (JOIS)

Claims

(57) [Claims]

A step of statistically examining the appearance of characters and character strings in sample document data to create index type data serving as common information when creating the index data; and searching according to the type of the index type data. Creating index data relating to the target document data. In the index type data creating step, as an operation of statistically examining the appearance of the character string, one character is used for characters having a certain frequency or less (low-frequency characters). It is determined that an index is to be created, and for characters having a certain frequency or higher (high-frequency characters), two consecutive characters between high-frequency characters are checked. For ()), it is decided to create an index with two characters, and for two consecutive characters of a certain frequency or more (high-frequency consecutive two characters), three consecutive characters of high-frequency consecutive two characters By sequentially performing an operation to investigate, it decides to create more frequent string, the index as long character sequence
An index creation method characterized by creating index type data of contents .

2. A character appearance frequency calculating means for statistically examining a degree of appearance of a certain character in sample document data, and a character appearance frequency calculating means for determining whether the degree of appearance of a character examined last time is higher than a predetermined value. A plurality of N-character consecutive appearance frequency calculating means for statistically examining the degree of appearance of N character strings (N is a natural number of 2, 3, 4,...) Including all of the checked characters; It becomes common information when creating index data on search target document data from the output of the character appearance frequency calculating means and the plurality of N-character consecutive appearance frequency calculating means according to the degree of appearance of a character or a character string in data. An index creation device comprising: index type output means for creating index type data.

A plurality of grouping means for grouping characters or character strings in the sample document data in accordance with the degree of appearance thereof, wherein the index type output means outputs the grouping output from each of the grouping means. 3. The index creation device according to claim 2, wherein a correspondence table between a serial number of each group and a character or character string belonging to the group is output based on the information.

4. A character continuation number calculating means for determining the number of search characters to be used when generating index data relating to search target document data according to index type data output from the index creation device according to claim 3, and said character continuation number A group number calculating means for calculating a corresponding group number from the number of characters determined by the calculating means and the index type data output from the index creating device according to claim 3, and a group number output from the group number calculating means. An index creation device comprising: index information accumulation output means for creating index data for each document record.

5. A search input means for inputting a search condition including a character string pattern, and a character or character string for collating index data output from the index creation device according to the search condition. An index matching condition creating means for creating an OR tree; an index matching means for matching the index data with the AND / OR tree created by the index matching condition creating means; And a full-text scanning character string matching unit that matches a corresponding part of the search condition with a search condition including a character string pattern input from the search condition input unit, and outputs a part that has been successfully matched as a final search result. Search device.