JPH07334515A

JPH07334515A - Information retrieval method and device

Info

Publication number: JPH07334515A
Application number: JP6122101A
Authority: JP
Inventors: Tomoyuki Miyashita; 朋之宮下; Katsunobu Shibata; 克信柴田
Original assignee: Nippon Steel Corp
Current assignee: Nippon Steel Corp
Priority date: 1994-06-03
Filing date: 1994-06-03
Publication date: 1995-12-22
Anticipated expiration: 2018-08-18
Also published as: JP3438947B2

Abstract

(57)【要約】【目的】文献データベース等の分野において、検索者の
意図するもの（文献）をより高速で検索することが可能
な情報検索方法を提供する。【構成】検索文字列を入力して（ステップ１０１）、連
字に分解する（ステップ１０２）。連字について、その
出現頻度や文字種の組み合わせに応じて重要度を算出し
（ステップ１０４〜１０６）、検索対象文書にその連字
が含まれているかどうかを調べ（ステップ１０８）、含
まれている場合にはその連字の重要度をその検索対象文
書の重要度に加算する（ステップ１０９）。 (57) [Abstract] [Purpose] To provide an information search method capable of searching a document (document) intended by a searcher at a higher speed in a field such as a document database. [Structure] A search character string is input (step 101) and decomposed into consecutive characters (step 102). For consecutive characters, the degree of importance is calculated according to the appearance frequency and the combination of character types (steps 104 to 106), and it is checked whether or not the consecutive characters are included in the search target document (step 108). In this case, the importance of the consecutive character is added to the importance of the document to be searched (step 109).

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文献データベースなど
における情報の検索方法および装置に関し、特に、全文
検索や全物件検索が可能な情報検索方法および装置に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for retrieving information in a literature database or the like, and more particularly to an information retrieval method and apparatus capable of full-text retrieval or all-item retrieval.

【０００２】[0002]

【従来の技術】文献などの情報を物件として多数保持す
るデータベースからユーザが必要とする物件を検索する
場合、各物件に予めキーワードを付与しておいて検索キ
ーと一致するキーワードを検索することにより所望の物
件を探し出すキーワード検索方法と、全情報の全文検索
を行なって検索キーを含むものを探し出す直接検索方法
とがある。キーワード検索方法には、データベースに
格納される各物件に予めキーワードを付与するのでその
作業にかなりの人手を要し、任意にキーワードを付与
した場合にはキーワードの個数が膨大となるのでシソー
ラスによる管理が必要となり、また、的確なキーワー
ドの付与が難しく、所望の物件に到達できないことがあ
る、という問題点がある。また、直接検索方法には、
検索対象となる物件数が多い場合や各物件のデータ量が
大きい場合に現実的な時間の範囲内で検索を完了させる
ことができず、また、いわゆる曖昧検索を一般的には
行なうことができないという問題点がある。2. Description of the Related Art When searching for a property that a user needs from a database that holds a large number of information such as documents as properties, by assigning a keyword to each property in advance and searching for a keyword that matches the search key, There are a keyword search method for searching for a desired property and a direct search method for searching for a property including a search key by performing a full-text search for all information. Since the keyword search method assigns keywords to each property stored in the database in advance, it requires a considerable amount of manpower to perform the task. If any keywords are assigned, the number of keywords will be enormous. However, there is a problem that it is difficult to assign an accurate keyword, and it may not be possible to reach a desired property. Also, the direct search method is
If the number of properties to be searched is large or the amount of data for each property is large, the search cannot be completed within a realistic time range, and so-called fuzzy search cannot be generally performed. There is a problem.

【０００３】本出願人は、上述したような従来の情報検
索方法の諸問題点を解決するため、特開平４−３２６１
６４号公報において、キーワードの付与を必要とせず、
高速での検索が行なえるデータベース検索システムを提
案した。このデータベース検索システムは、検索対象の
各物件ごとに自己相関情報を予め求めておき、検索時に
は、検索キーの自己相関情報を求めて検索キーの自己相
関情報と物件の自己相関情報との合致度を物件ごとに求
め、この合致度の高い物件から出力しようとするもので
ある。自己相関情報としては、例えば、固定サイズの二
値行列が使用される。この検索システムは、各物件の
データ量によらずに固定サイズの自己相関情報に基づい
て検索がなされるので、検索時間の大幅な短縮が図れ、
検索キーと少し異なる表記の文字列を含む物件に対し
てはかなり大きな合致度が得られるので、曖昧検索を行
なうことが可能となり、自己相関情報の算出を自動的
に行なうことができるので、キーワード検索方法に比
べ、データベース構築時の作業量を大幅に減少させるこ
とができ、さらに、検索キーをそのまま含む物件に対
しては最大の合致度が与えられるので、そのような物件
を見逃すことがない、などの利点を有する。In order to solve the above-mentioned problems of the conventional information retrieval method, the applicant of the present invention has disclosed in Japanese Patent Laid-Open No. 3261/1992.
No. 64 publication does not require the addition of keywords,
We proposed a database search system that enables high-speed search. This database search system obtains autocorrelation information for each property to be searched in advance, and at the time of search, obtains autocorrelation information of the search key and determines the degree of matching between the autocorrelation information of the search key and the autocorrelation information of the property. Is calculated for each property, and the property with the highest degree of matching is to be output. For example, a fixed-size binary matrix is used as the autocorrelation information. Since this search system searches based on fixed-size autocorrelation information regardless of the amount of data for each property, the search time can be greatly shortened.
For properties that include a character string that is a little different from the search key, a fairly high degree of matching can be obtained, so it is possible to perform a fuzzy search and automatically calculate the autocorrelation information. Compared to the search method, the amount of work required to build the database can be significantly reduced, and the maximum matching score is given to properties that include the search key, so you will not miss such properties. , And so on.

【０００４】ここでこの特開平４−３２６１６４号公報
に記載されたデータベース検索システムの具体例を説明
する。ここでは、検索対象の物件が英文テキストであっ
て、テキストの各文字がＡＳＣＩＩ（アスキー）コード
で表現されているものとする。ＡＳＣＩＩコードは通常
８ビットであるが、英文の通常文字を使用している限り
最上位ビットは使用されないので、下位側の７ビットの
みを考慮する。自己相関情報としては、各文字が７ビッ
トのコードによって０から１２７までのいずれかの整数
で表わされているので、１２８×１２８の二値行列を使
用する。この行列の各要素は、"０"に初期化されている
ものとする。A specific example of the database search system described in Japanese Patent Laid-Open No. 4-326164 will be described below. Here, it is assumed that the property to be searched is an English text and each character of the text is represented by an ASCII (ASCII) code. The ASCII code is usually 8 bits, but since the most significant bit is not used as long as normal characters in English are used, only the lower 7 bits are considered. As the autocorrelation information, since each character is represented by an integer from 0 to 127 by a 7-bit code, a 128 × 128 binary matrix is used. Each element of this matrix is assumed to be initialized to "0".

【０００５】まず、英文テキストから、この英文テキス
トの各文字についてその文字を先頭とし所定の文字数か
らなる連字を抽出する。英文テキストが「This_is_a_pe
n.」（ここで"_"はスペースを表わす。）であり、所定
の文字数が３文字であれば、"Thi","his","is_","s_
a","_a_","a_p","_pe","pen","en.","n."の連字が抽出
される。続いて、抽出された連字においてその連字の１
文字目、２文字目、３文字目の文字コードがそれぞれｃ
₁,ｃ₂,ｃ₃であったとすれば、自己相関情報を表わす二
値行列の要素(ｃ₁,ｃ₂)と(ｃ₁,ｃ₃)の値を"１"にセット
する。この操作を抽出された全ての連字について実行す
ることにより、対象としている英文テキストの自己相関
情報が得られたことになる。First, from the English text, for each character of the English text, a consecutive character having a predetermined number of characters starting from that character is extracted. The English text is "This_is_a_pe
n. "(where" _ "represents a space), and if the predetermined number of characters is 3," Thi "," his "," is_ "," s_ "
The consecutive characters a "," _ a _ "," a_p "," _ pe "," pen "," en. "," n. "are extracted.
The character code of the second, third, and third characters is c
If the values are ₁ , c ₂ and c ₃ , the values of the elements (c ₁ and c ₂ ) and (c ₁ and c ₃ ) of the binary matrix representing the autocorrelation information are set to "1". By performing this operation for all extracted consecutive characters, the autocorrelation information of the target English text is obtained.

【０００６】一方、検索時には、まず、上述と同様の手
順によって検索キーの自己相関情報を求める。そして、
物件ごとに、その物件の自己相関情報の二値行列と検索
キーから求めた二値行列とを比較し、検索キーからの二
値行列で"１"となっている各要素が物件から求めた二値
行列において"１"になっているかどうかを調べる。検索
キーに対応する二値行列で"１"となっている行列の要素
のうち物件に対応する二値行列で"１"となっているもの
割合を合致度とする。そして、この合致度が高い方から
順に物件を出力する。On the other hand, at the time of retrieval, first, the autocorrelation information of the retrieval key is obtained by the same procedure as described above. And
For each property, the binary matrix of the autocorrelation information of the property is compared with the binary matrix obtained from the search key, and each element that is "1" in the binary matrix from the search key is obtained from the property. Check whether it is "1" in the binary matrix. Of the elements of the matrix that is "1" in the binary matrix corresponding to the search key, the ratio of those that are "1" in the binary matrix corresponding to the property is the matching degree. Then, the properties are output in order from the one with the highest degree of matching.

【０００７】この具体例では、連字に基づいて自己相関
情報が算出されている。大まかにいえば、検索キーに含
まれる連字のうちどれだけのものが物件に含まれている
かに応じて、合致度が算出される。そして、連字の一致
を逐語的に調べるのではなく、自己相関情報に変換した
上で合致度を算出することによって、極めて高速での検
索が可能となっている。In this specific example, the autocorrelation information is calculated based on consecutive characters. Roughly speaking, the degree of matching is calculated according to how many consecutive characters included in the search key are included in the property. Then, rather than checking the matching of consecutive characters verbatim, by converting the autocorrelation information and then calculating the matching degree, the search can be performed at extremely high speed.

【０００８】[0008]

【発明が解決しようとする課題】特開平４−３２６１６
４号に開示されたデータベース検索システムによって、
上述したように、キーワードを使用せずに高速で文献検
索を行なうことが可能となった。しかし、文献データベ
ースに格納される文書数や情報量は急増の一途をたどっ
ており、より高速であって確実な文献検索の実現が求め
られている。[Patent Document 1] Japanese Patent Application Laid-Open No. 4-32616
By the database search system disclosed in No. 4,
As described above, it has become possible to perform a document search at high speed without using a keyword. However, the number of documents and the amount of information stored in the literature database are increasing rapidly, and it is required to realize faster and more reliable literature retrieval.

【０００９】本発明の目的は、検索者の意図するものを
より高速で検索することが可能な情報検索方法および装
置を提供することにある。It is an object of the present invention to provide an information retrieval method and apparatus capable of retrieving what a retrieval person intends at a higher speed.

【００１０】[0010]

【課題を解決するための手段】本発明の情報検索方法
は、入力される検索文字列に基づき、検索対象文書の集
合の中から所望の文書の検索を行なう情報検索方法であ
って、所定の文字長である連字を前記検索文字列から抽
出して検索用連字群を構成する連字抽出工程と、前記検
索用連字群に含まれる連字ごとに、当該連字に対する重
要度を求める重要度決定工程と、前記各検索対象文書に
ついて、前記検索用連字群に含まれる連字ごとに当該連
字が当該検索対象文書に含まれるかを調べ、当該連字が
含まれている場合には当該連字に対応する重要度を当該
検索対象文書の重要度に加算する検索工程と、前記検索
工程の実施後、各検索対象文書ごとの前記重要度に応じ
て検索結果の出力を行なう出力工程とを有する。An information retrieval method of the present invention is an information retrieval method for retrieving a desired document from a set of retrieval target documents based on an input retrieval character string, A consecutive-character extracting step of forming a consecutive-character group for search by extracting consecutive-characters having a character length from the search character string; and, for each consecutive character included in the consecutive-character group for search, the importance of the consecutive character is calculated. For the required importance determining step and each of the search target documents, it is checked for each consecutive character included in the search consecutive character group whether the consecutive character is included in the search target document, and the consecutive character is included. In this case, a search step of adding the importance level corresponding to the consecutive character to the importance level of the search target document, and the output of the search result according to the importance level of each search target document after the search step is performed. And an output step to be performed.

【００１１】本発明の情報検索装置は、入力される検索
文字列に基づき、検索対象文書の集合の中から所望の文
書の検索を行なう情報検索装置であって、前記検索文字
列を入力する入力手段と、所定の文字長である連字を前
記検索文字列から抽出して検索用連字群を構成する連字
抽出手段と、前記検索対象文書ごとの重要度を格納する
重要度格納手段と、前記検索用連字群に含まれる連字ご
とに当該連字に対する重要度を決定し、前記各検索対象
文書について、前記検索用連字群に含まれる連字ごとに
当該連字が当該検索対象文書に含まれるかを調べ、当該
連字が含まれている場合には当該連字に対応する重要度
を前記重要度格納手段における当該検索対象文書の重要
度に加算する検索手段と、前記重要度格納手段を参照
し、前記各検索対象文書ごとの重要度に応じて検索結果
の出力を行なう出力手段とを有する。The information retrieval apparatus of the present invention is an information retrieval apparatus for retrieving a desired document from a set of retrieval target documents based on an inputted retrieval character string, and an input for inputting the retrieval character string. Means, a consecutive-character extracting means for extracting a consecutive character having a predetermined character length from the search character string to form a consecutive-character group for search, and an importance storage means for storing the importance of each document to be searched. , The importance for the consecutive character is determined for each consecutive character included in the consecutive character group for search, and the consecutive character is searched for each consecutive character included in the consecutive character group for search for each search target document. A search unit that checks whether the document is included in the target document, and, if the consecutive character is included, adds the importance corresponding to the consecutive character to the importance of the search target document in the importance storage unit; Refer to the importance degree storage means, and search for each of the above Depending on the importance of each book and an output means for outputting the retrieval result.

【００１２】[0012]

【作用】文献データベースの検索を行なおうとする場
合、文書自体の局所的な構造に注目して検索を行なうの
が一般的である。注目している場所に書かれている内容
とその例えば５ページ先に書かれている内容との相関に
よって検索を行ないたいなどということは、まずありえ
ない。局所的な構造に注目した場合、数語ないし数十語
の長さの検索キーをそのままで使用するよりも、「従来
の技術」でも述べたように、検索キーを連字に分解しこ
の連字に基づいて検索を行なった方が効率的である。When a document database is searched, it is common to focus on the local structure of the document itself. It is unlikely that you would like to perform a search based on the correlation between the content written in the location of interest and the content written, for example, five pages ahead. When paying attention to the local structure, rather than using a search key with a length of several words to several tens of words as it is, as described in "Prior Art", the search key is decomposed into consecutive characters and this series is used. It is more efficient to search based on the letters.

【００１３】ところで、検索者が実際に検索を行なおう
としている局面を考えると、この検索者は何らかの意味
を托して検索キーを選んでいるばずである。また、検索
対象の文書を考えた場合、この文書に含まれる単語が全
て同等の重みをもつのではなく、その文書の識別に役立
つ特徴的な単語とそうでない単語とが混在している。従
来の検索方法では、特に、連字による場合、検索結果に
対する各連字の寄与分は同等であって、検索者の托した
意味や文書の特徴的な内容を反映しておらず、このた
め、全く意図しない文書をヒットしたりすることが多か
った。By the way, considering the situation in which the searcher is actually trying to perform a search, this searcher must select a search key with some meaning. Further, when considering a document to be searched, not all the words included in this document have the same weight, but characteristic words that are useful for identifying the document and words that are not so are mixed. In the conventional search method, especially in the case of consecutive characters, the contribution of each consecutive character to the search result is equal and does not reflect the meaning of the searcher's choice or the characteristic content of the document. , I often hit unintended documents.

【００１４】本発明では、検索キー（検索文字列）を連
字に分解した上で、その連字が検索のために特徴的なも
のなのかそうでないのかを判断し、より特徴的な連字が
検索結果により大きく寄与するようにしているので、検
索者の意図に沿って確実に検索を行なうことが可能とな
る。具体的には、各連字についてその連字の検索結果に
寄与する割合を重要度として定め、検索対象文書中にそ
の連字が含まれる場合には、当該連字の重要度をその検
索対象文書の重要度（評価値）に加算するようにすれば
よい。連字に対する重要度の定め方としては、例えば、
検索に使用される連字ごとに、検索対象の全文書を通
じてのその連字の出現頻度を求め、出現頻度の高い連字
ほど重要度を小さくする、連字ごとに連字を構成する
文字種を求め、その文字種の組み合わせによってその連
字の重要度を定める、などの方法があり、これらおよ
びの方法を併用するようにしてもよい。In the present invention, the search key (search character string) is decomposed into consecutive characters, and then it is judged whether or not the consecutive characters are characteristic for the search, and more characteristic consecutive characters are searched. Since it greatly contributes to the search result, it is possible to surely perform the search according to the intention of the searcher. Specifically, for each consecutive character, the proportion that contributes to the search result of that consecutive character is determined as the importance, and if the consecutive character is included in the search target document, the importance of the consecutive character is set as the search target. It may be added to the importance (evaluation value) of the document. As a method of determining the importance for consecutive characters, for example,
For each consecutive character used in the search, determine the frequency of occurrence of that consecutive character in all the documents to be searched. Decrease the importance of consecutive characters that occur more frequently. Select the character type that makes up consecutive characters for each consecutive character. There is a method of determining the importance of the consecutive characters according to the combination of the character types, and these methods may be used together.

【００１５】重要度算出における上記の方法は、簡単
に言えば、各検索対象文書に共通に現れるものほど、所
望する文書を特定する度合いは低いということに基づい
ている。例えば、各種の活用語尾や、英語の文章におけ
る"the"や助動詞、日本文における助詞や助動詞が、共
通に出現しやすいものに該当する。一方、上記の方法
は、例えば日本語の文章中では漢字やひらがな、数字等
が混在して使用されるが、漢字どうしの結合は熟語とし
て特徴的な意味を有することが多いということに基づい
ている。日本語以外の言語であっても、例えば中国語に
おいては、助詞になりうる漢字と助詞にはならない漢字
の区別があってこの区別に応じて文字種を定めることが
可能である。また、英語などにおいても、大文字と小文
字や、数字、ギリシャ文字、ハイフンなどの記号である
かに応じて文字種を定めることができる。英語の場合、
過去型語尾の"ed"や副詞語尾の"ly"などを別の文字種と
して扱うような処理も可能である。The above method for calculating the importance is based on the fact that the more commonly appearing in each document to be searched, the lower the degree of specifying the desired document is. For example, various inflection endings, "the" and auxiliary verbs in English sentences, and particles and auxiliary verbs in Japanese sentences correspond to those that are likely to appear in common. On the other hand, the above method is based on the fact that kanji, hiragana, numbers, etc. are mixed in Japanese sentences, but the combination of kanji often has a characteristic meaning as a idiom. There is. Even in languages other than Japanese, for example, in Chinese, there is a distinction between a kanji that can be a particle and a kanji that is not a particle, and the character type can be determined according to this distinction. In addition, in English and the like, the character type can be determined depending on whether it is a capital letter or a small letter or a symbol such as a number, a Greek letter, or a hyphen. In English,
It is also possible to process the past type ending "ed" and the adverb ending "ly" as different character types.

【００１６】本発明において、連字ごとの重要度の具体
的な算出方法は、例えば、検索対象となる文書の言語や
分野（例えば、技術文献であるか、新聞記事であるか、
文学作品であるかなど）、検索者のくせ（どのような検
索キーをよく使うかなど）、検索の目的（あいまい検索
を行なうかどうか）などに応じ、さらには検索文字列そ
のものに応じて、変化させることが可能であり、適応的
に変化させることもできる。重要度の算出方法を必要に
応じて変化させることにより、さらに的確な検索を行な
うことが可能となる。In the present invention, a specific method of calculating the importance for each consecutive character is, for example, the language or field of the document to be searched (for example, technical literature, newspaper article,
Depending on the searcher's habit (what kind of search key is often used, etc.), the purpose of the search (whether to perform fuzzy search), and the search character string itself. It can be changed and can be changed adaptively. By changing the method of calculating the degree of importance as necessary, a more accurate search can be performed.

【００１７】[0017]

【実施例】次に本発明の実施例について、図面を参照し
て説明する。図１は本発明の一実施例の情報検索装置の
構成を示すブロック図であり、図２はこの情報検索装置
を使用し本発明の方法によって情報の検索を行なう場合
の処理手順を示すフローチャートである。Embodiments of the present invention will now be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of an information retrieval apparatus according to an embodiment of the present invention, and FIG. 2 is a flow chart showing a processing procedure when information is retrieved by the method of the present invention using this information retrieval apparatus. is there.

【００１８】この情報検索装置１１は、データベース格
納部１０に検索対象文書として蓄積されている文献情報
の検索を実行するためのものであり、検索者が指定した
検索文字列を入力する検索文字列入力部１２と、所定の
文字長である連字を入力された検索文字列から抽出する
連字抽出部１３と、検索文字列から抽出された各連字に
対して当該連字の重要度を算出するとともにこれらの連
字に基づいてデータベース格納部１０中の各検索対象文
書を実際に検索する検索エンジン部１４と、検索文字列
から抽出された各連字に対して重要度を算出する際に使
用されるパラメータを格納するパラメータ格納部１５
と、検索対象文書ごとの重要度を格納する重要度格納部
１６と、重要度格納部１６を参照し各検索対象文書ごと
の重要度に応じて正規化を行ない検索結果の出力を行な
う出力部１７とによって構成されている。なお、検索文
字列から抽出された連字の集合を検索用連字群という。The information search device 11 is for executing a search for document information stored as search target documents in the database storage unit 10, and is a search character string for inputting a search character string designated by a searcher. An input unit 12, a consecutive-character extracting unit 13 that extracts consecutive characters having a predetermined character length from the input search character string, and the importance of the consecutive character for each consecutive character extracted from the search character string. A search engine unit 14 that calculates and actually searches each search target document in the database storage unit 10 based on these consecutive characters, and when calculating an importance degree for each consecutive character extracted from the search character string. Parameter storage unit 15 for storing parameters used for
And an importance storage unit 16 that stores the importance of each search target document, and an output unit that refers to the importance storage unit 16 and performs normalization according to the importance of each search target document and outputs a search result. 17 and 17. The set of consecutive characters extracted from the search character string is referred to as a search consecutive character group.

【００１９】検索エンジン部１４は、検索用連字群の中
の各連字についてその連字に対する重要度を算出する
が、本実施例では、連字の出現頻度から定まる第１種の
重要度と、連字を構成する文字種から定まる第２種の重
要度の２通りの重要度をそれぞれの連字について求めて
いる。第１種および第２種の重要度は、それぞれ、０ま
たは正の実数であって、大きな値をとるものほど検索結
果に大きく寄与するように設定されている。以下、これ
らの重要度について説明する。The search engine unit 14 calculates the importance of each consecutive character in the consecutive-characters group for search for the consecutive character. In the present embodiment, the first-type importance determined by the appearance frequency of consecutive characters. Then, two types of importance, that is, the second kind of importance determined from the character types forming the consecutive characters, are obtained for each consecutive character. The importances of the first type and the second type are 0 or positive real numbers, respectively, and are set such that the larger the value, the greater the contribution to the search result. Hereinafter, these importance levels will be described.

【００２０】第１種の重要度は、検索エンジン部１４に
よって、データベース格納部１０に格納された全ての検
索対象文書を通してのその連字の出現頻度を算出し、出
現頻度が小さいほど大きな値となり、出現頻度が大きい
ほど小さな値となるように、定められる。ここでこの出
現頻度は、その連字を含む文書の数を文書の総数で除し
たものであって、０から１までの実数で表わされる。図
２は、ｘ軸に出現頻度を、ｙ軸に第１種の重要度の値を
とったグラフであって、出現頻度と第１種の重要度との
関係を表わす関数の一例を示している。このグラフから
も明らかなように、第１種の重要度は、出現頻度に対し
て単調減少となる関数で表わされる。なお、全ての検索
対象文書に出現する連字は、検索に用いるものとしては
無意味であるから、このような連字に対しては第１種の
重要度が０になるようにすることが望ましい。The importance of the first type is calculated by the search engine unit 14 by calculating the appearance frequency of the consecutive characters through all the search target documents stored in the database storage unit 10. The smaller the appearance frequency is, the larger the value is. The higher the appearance frequency, the smaller the value. Here, this appearance frequency is obtained by dividing the number of documents including the consecutive characters by the total number of documents, and is represented by a real number from 0 to 1. FIG. 2 is a graph in which the x-axis represents the frequency of appearance and the y-axis represents the value of the importance of the first type, showing an example of a function representing the relationship between the frequency of appearance and the importance of the first type. There is. As is clear from this graph, the importance of the first type is represented by a function that monotonically decreases with respect to the appearance frequency. Since consecutive characters appearing in all documents to be searched are meaningless for use in searching, it is preferable to set the importance of the first type to 0 for such consecutive characters. desirable.

【００２１】第２種の重要度は、連字の文字長が２文字
の場合であれば、連字を構成する１文字目と２文字目の
文字種の組み合わせに応じて、検索エンジン部１４によ
って決定される。具体的には、パラメータ格納部１５内
に、文字種の組み合わせに応じた第２種の重要度の値を
表わす計算用テーブル２１を設けておき、この計算用テ
ーブル２１を参照することによって、各連字ごとに求め
られる。検索対象文字が日本語の文書である場合の計算
用テーブル２１の内容の一例が図３に示されている。図
示された例では、文字種としてひらがな、カタカナ、漢
字、英数字、記号に分類し、連字が漢字のみで構成され
る場合に第２種の重要度が最も大きな値となるようにな
っている。なお、漢字とひらがなの組み合わせから分か
るように、１番目の文字の文字種と２番目の文字の文字
種とを入れ替えた場合に、同じ第２種の重要度の値にな
るとは限らない。１文字目が漢字で２文字目がひらがな
の場合は、熟語（名詞）の最後の文字＋助詞の組み合わ
せである場合が圧倒的であり、文字種の組み合わせの順
が逆になっている場合に比べ、より特徴的でないと考え
られるからである。連字の文字長が３文字以上である場
合の扱いも、基本的にはここで述べたものと同様であ
る。If the character length of the consecutive characters is two, the second-class importance is determined by the search engine unit 14 according to the combination of the first and second character types forming the consecutive characters. It is determined. Specifically, the parameter storage unit 15 is provided with a calculation table 21 that represents the value of the second-type importance according to the combination of character types, and by referring to this calculation table 21, each connection Required for each character. FIG. 3 shows an example of the contents of the calculation table 21 when the search target character is a Japanese document. In the example shown in the figure, the character types are classified into hiragana, katakana, kanji, alphanumeric characters, and symbols, and when the consecutive characters are composed of kanji only, the importance of the second type becomes the largest value. . As can be seen from the combination of kanji and hiragana, when the character type of the first character and the character type of the second character are interchanged, the value of the same second type importance does not always result. When the first character is Kanji and the second character is Hiragana, it is overwhelming that the combination of the last character of the idiom (noun) + particle is overwhelming, and the order of the combination of character types is reversed. , Because it is considered less characteristic. The handling when the character length of consecutive characters is 3 characters or more is basically the same as that described here.

【００２２】第１種および第２種の重要度を求めた上
で、検索エンジン部１４は、その実際の検索処理を実行
するようになっている。検索のアルゴリズムとしては、
例えば、上述した特開平４−３２６１６４号公報に記載
されたものがある。検索処理は、各検索対象文書ごと
に、検索用連字群の各連字についてその連字が当該検索
対象文書に含まれているかどうかを判定し、その連字が
含まれている場合にはその連字に対応する第１種および
第２種の重要度を重要度格納部１６内の当該検索対象文
書の重要度に加算することによって、行なわれる。実際
には、検索用連字群の全ての連字についての第１種およ
び第２種の重要度を算出してから一括して検索処理を実
行してもよいし、検索用連字群から１個の連字を取り出
し、その連字について第１種および第２種の重要度を求
め、その上でその連字が各検索対象文書に含まれるいる
かどうかを調べることを各連字について繰り返して実行
するようにしてもよい。The search engine unit 14 is adapted to execute the actual search processing after determining the importance of the first type and the second type. As a search algorithm,
For example, there is one described in the above-mentioned JP-A-4-326164. The search processing determines, for each search target document, whether or not each consecutive character of the search consecutive characters group is included in the search subject document, and when the consecutive character is included, This is performed by adding the importance of the first type and the second type corresponding to the consecutive character to the importance of the search target document in the importance storage unit 16. In practice, the search processing may be executed collectively after calculating the first-type and second-type importances for all consecutive characters in the search consecutive-character group. Repeat one for each consecutive character by extracting one consecutive character, obtaining the importance of the first and second types for that consecutive character, and then checking whether that consecutive character is included in each search target document. You may make it execute.

【００２３】重要度格納部１６の構成例が図４に示され
ている。重要度格納部１６は、データベース格納部１０
内の各検索対象文書ごとにふられた文書番号とその文書
番号に対応する検索対象文書の重要度とからなる表とし
て構成されている。重要度の初期値は０である。検索対
象文書ごとの重要度は、上述の「従来の技術」における
合致度に対応する。An example of the structure of the importance storage unit 16 is shown in FIG. The importance storage unit 16 is the database storage unit 10.
It is configured as a table including the document number assigned to each search target document in the table and the importance of the search target document corresponding to the document number. The initial value of importance is 0. The degree of importance of each search target document corresponds to the degree of coincidence in the above-mentioned “conventional technology”.

【００２４】次に、本発明の方法に基づきこの情報検索
装置１０を用いて行なう情報検索の手順について、図５
を使用して説明する。Next, referring to FIG. 5, a procedure of information retrieval performed by using the information retrieval apparatus 10 based on the method of the present invention will be described.
To explain.

【００２５】まず、検索文字列入力部１２を介して検索
文字列を入力し（ステップ１０１）、連字抽出部１３に
よってこの検索文字列を所定の文字長の連字に分解する
（ステップ１０２）。連字の長さが２文字、検索文字列
が例えば「大阪に出張する。」であれば、「大阪」、「阪
に」、「に出」、「出張」、「張す」、「する」、「る。」の各連字が抽
出され、検索用連字群を構成する。First, a search character string is input through the search character string input unit 12 (step 101), and the search character string is decomposed into consecutive characters having a predetermined character length by the consecutive character extracting unit 13 (step 102). . If the length of the consecutive characters is 2 characters and the search character string is, for example, "I will make a business trip to Osaka.", "Osaka", "Saka ni", "Ide", "Business trip", "Chang", "Yes" , And “ru.” Are extracted to form a search consecutive character group.

【００２６】次に、検索用連字群の各連字について、以
下の処理を行なう。すなわち、未処理の連字が残ってい
るかを判定し（ステップ１０３）、残っている場合に
は、未処理の連字について、検索エンジン部１４によっ
て、全ての検索対象文書を通じたその連字の出現頻度を
算出し（ステップ１０４）、算出された出現頻度に応じ
てその連字に対する第１の重要度を算出し（ステップ１
０５）、その連字を構成する文字種の組み合わせに応じ
てその連字に対する第２種の重要度を決定する（ステッ
プ１０５）。次に、データベース格納部１０内の検索対
象文書であってその連字について未検索の文書があるか
どうかを調べ（ステップ１０７）、未検索の文書がある
場合には未検索の文書のうちから１つの文書を選択し、
その文書中にその連字が含まれるかどうかを判定する
（ステップ１０８）。含まれていない場合にはそのまま
ステップ１０７に戻り、含まれている場合には、重要度
格納部１６におけるその文書の重要度に、その連字に対
する第１種および第２種の重要度を加算し（ステップ１
０９）、ステップ１０７に戻る。ステップ１０７で未検
索の文書が残っていないと判定されたら、検索用連字群
に含まれる次の連字による検索のために、ステップ１０
３に戻る。Next, the following processing is performed for each consecutive character in the retrieval consecutive character group. That is, it is determined whether or not unprocessed consecutive characters remain (step 103), and if there is remaining unprocessed consecutive characters, the unprocessed consecutive characters are searched for by the search engine unit 14 through all the search target documents. The appearance frequency is calculated (step 104), and the first importance for the consecutive character is calculated according to the calculated appearance frequency (step 1).
05), the degree of importance of the second type for the consecutive characters is determined according to the combination of the character types forming the consecutive characters (step 105). Next, it is checked whether or not there is an unsearched document for the consecutive character in the database storage unit 10 that is a search target document (step 107). Select one document,
It is determined whether the consecutive character is included in the document (step 108). If it is not included, the process directly returns to step 107. If it is included, the importance of the first type and the second type for the consecutive character is added to the importance of the document in the importance storage unit 16. (Step 1
09), and returns to step 107. If it is determined in step 107 that there is no unsearched document remaining, step 10 is performed to search for the next consecutive character included in the consecutive-character group for search.
Return to 3.

【００２７】上述した「大阪に出張する。」という検索
文字列から２文字ずつの連字を抽出した場合、連字「す
る」や「る。」は各文書に共通して出現しやすいので、第
１種の重要度は小さくなる。また、図３に示すような計
算用テーブルを使用している場合には、連字「阪に」に対
する第２種の重要度は小さくなる。一方、連字「大阪」
は、漢字２文字からなるので第２種の重要度は大きくな
る。また、各検索対象文書を通じての「大阪」の出現頻度
が小さく、「出張」の出現頻度がある程度大きいものとす
れば、第１種の重要度は「大阪」の方が「出張」よりも大き
くなる。結局、全体的に見れば、「大阪」の寄与度合が大
きい検索がなされることになる。When two consecutive characters are extracted from the above-mentioned search character string "I will travel to Osaka.", The consecutive characters "Suru" and "Ru." Are likely to appear in each document. The importance of the first type becomes small. Further, when the calculation table as shown in FIG. 3 is used, the importance of the second type with respect to the consecutive character "Saka ni" becomes small. On the other hand, consecutive characters "Osaka"
Is composed of two Kanji characters, so the importance of the second type becomes large. If the frequency of appearance of “Osaka” through each search target document is low and the frequency of occurrence of “business trip” is high to some extent, the importance of the first type is “Osaka” is larger than “Business trip”. Become. After all, a search with a large degree of contribution of “Osaka” will be made overall.

【００２８】ステップ１０３で未処理の連字が残ってい
ないと判定された場合は、検索用連字群の全ての連字に
基づく検索処理が終了した場合であるから、出力部１７
に制御を移し、重要度格納部１６に格納された各文書ご
との重要度の値を正規化する（ステップ１１０）。ここ
で正規化とは、最大の重要度が１になるように、各重要
度に同一の係数を乗算する処理のことである。検索文字
列（検索キー）が長いほど検索用連字群の連字の数が多
くなり、そのため各文書ごとの生の重要度が大きくなり
がちであるが、このように正規化を行なうことにより、
このような検索文字列の相違により重要度の値のばらつ
きを補正することが可能となる。そして、正規化された
重要度を大きい方から順に並べ（ステップ１１１）、文
書ごとの重要度をリストとして出力することによって検
索結果の出力を行ない（ステップ１１２）、処理を終了
する。If it is determined in step 103 that there are no unprocessed consecutive characters remaining, it means that the retrieval process based on all consecutive characters in the retrieval consecutive character group has been completed.
Then, the control is shifted to (3), and the importance value of each document stored in the importance storage unit 16 is normalized (step 110). Here, the normalization is a process of multiplying each importance by the same coefficient so that the maximum importance becomes 1. The longer the search string (search key), the greater the number of consecutive characters in the consecutive search character group, which tends to increase the raw importance of each document. However, by normalizing in this way, ,
Due to such a difference in the search character strings, it is possible to correct the variation in the importance value. Then, the normalized importances are arranged in descending order (step 111), and the importance of each document is output as a list to output the search result (step 112), and the process ends.

【００２９】[0029]

【発明の効果】以上説明したように本発明は、検索文字
列から連字を抽出した上で各連字に対する重要度を求
め、検索結果に対する各連字の寄与割合をこの重要度に
応じて変化させることにより、検索において特徴的な連
字がより検索結果に反映することになって、検索者の意
図に沿って確実に検索を行なうことが可能となるという
効果がある。連字に対する重要度の定め方として、検
索に使用される連字ごとに、検索対象の全文書を通じて
のその連字の出現頻度を求め、出現頻度の高い連字ほど
重要度を小さくする、連字ごとに連字を構成する文字
種を求め、その文字種の組み合わせによってその連字の
重要度を定める、などの方法を採用することにより、よ
り確実な検索を行なうことが可能となる。As described above, according to the present invention, the consecutive characters are extracted from the search character string, the importance for each consecutive character is obtained, and the contribution ratio of each consecutive character to the search result is determined according to the importance. By changing it, the characteristic consecutive characters in the search are reflected more in the search result, and there is an effect that the search can be surely performed according to the intention of the searcher. As a method of determining the importance for consecutive characters, for each consecutive character used in the search, the frequency of occurrence of that consecutive character is calculated through all the documents to be searched, and the higher the occurrence frequency, the lower the importance. It is possible to perform a more reliable search by adopting a method in which the character type that forms a consecutive character is obtained for each character and the importance of the consecutive character is determined by the combination of the character types.

[Brief description of drawings]

【図１】本発明の一実施例の情報検索装置の構成を示す
ブロック図である。FIG. 1 is a block diagram showing a configuration of an information search device according to an embodiment of the present invention.

【図２】出現頻度と第１種の重要度の関係を表わす関数
の一例を示すグラフである。FIG. 2 is a graph showing an example of a function representing the relationship between the appearance frequency and the first type importance.

【図３】文字種と第２種の重要度との対応を表わす計算
用テーブルの構成の一例を示す図である。FIG. 3 is a diagram showing an example of a configuration of a calculation table showing correspondence between character types and second type importances.

【図４】重要度格納部の構成を示す図である。FIG. 4 is a diagram showing a configuration of an importance degree storage unit.

【図５】図１の情報検索装置を使用し本発明の方法によ
って情報の検索を行なう場合の処理手順を示すフローチ
ャートである。5 is a flow chart showing a processing procedure when information is searched by the method of the present invention using the information search device of FIG. 1. FIG.

[Explanation of symbols]

１０データベース格納部１１情報検索装置１２検索文字列入力部１３連字抽出部１４検索エンジン部１５パラメータ格納部１６重要度格納部１７出力部２１計算用テーブル１０１〜１１２ステップ 10 database storage unit 11 information retrieval device 12 search character string input unit 13 consecutive character extraction unit 14 search engine unit 15 parameter storage unit 16 importance storage unit 17 output unit 21 calculation table 101 to 112 steps

Claims

[Claims]

1. An information retrieval method for retrieving a desired document from a set of retrieval target documents based on an inputted retrieval character string, wherein a consecutive character having a predetermined character length is extracted from the retrieval character string. A consecutive-character extracting step for extracting and forming a consecutive-character group for searching; an importance determining step for obtaining the degree of importance of the consecutive-character for each consecutive character included in the consecutive-character group for search; For each consecutive character included in the consecutive character group for search, it is checked whether the consecutive character is included in the search target document. If the consecutive character is included, the importance corresponding to the consecutive character is determined. An information search method comprising: a search step of adding the importance of the search target document; and an output step of outputting a search result according to the importance of each search target document after the search step is performed.

2. The importance determining step calculates, for each consecutive character included in the search consecutive character group, the appearance frequency of the consecutive character through all search target documents, and the smaller the appearance frequency is, the smaller the appearance frequency is. The information retrieval method according to claim 1, which is a step of determining the importance of the consecutive characters such that the value becomes a value and the value becomes large if the appearance frequency is small.

3. The importance determining step is a step of, for each consecutive character included in the search consecutive character group, determining the importance of the consecutive character according to a combination of character types constituting the consecutive character. The information search method according to claim 1.

4. An information retrieval method for retrieving a desired document from a set of retrieval target documents based on an inputted retrieval character string, wherein a consecutive character having a predetermined character length is extracted from the retrieval character string. A consecutive-character extracting step of extracting and forming a consecutive-letter group for searching, and calculating the appearance frequency of the consecutive-character through all the search target documents for each consecutive character included in the consecutive-letter group for searching. If the frequency is high, the value becomes small, and if the appearance frequency is low, the value becomes large.
An importance determining step of determining the importance of the type and determining the importance of the second type for the consecutive character according to the combination of the character types forming the consecutive character; For each consecutive character included in the consecutive character group, it is checked whether the consecutive character is included in the search target document, and if the consecutive character is included, the first and second types corresponding to the consecutive character An information retrieval method including a retrieval step of adding the degree of importance to the degree of importance of the retrieval target document, and an output step of outputting a retrieval result according to the degree of importance of each retrieval target document after the retrieval step is performed. .

5. An information retrieval device for retrieving a desired document from a set of documents to be retrieved based on an inputted retrieval character string, comprising an input means for inputting the retrieval character string, and a predetermined character. A consecutive-character extracting unit that extracts a consecutive consecutive character from the search character string to form a consecutive-character group for retrieval, an importance storage unit that stores the degree of importance of each document to be searched, and the consecutive-character for retrieval Whether the consecutive character is included in the search target document is determined for each consecutive character included in the group, and the degree of importance for the consecutive character is determined for each consecutive character included in the search consecutive character group. And if the consecutive character is included, refer to the searching means for adding the importance degree corresponding to the consecutive character to the importance degree of the search target document in the importance degree storing means, and the importance degree storing means. However, depending on the importance of each document to be searched, Information retrieval apparatus and an output means for outputting search results Te.

6. The information search device according to claim 5, wherein the importance of consecutive characters is determined according to a function that monotonically decreases with respect to the appearance frequency of the consecutive characters in all search target documents.

7. The degree of importance for a consecutive character is determined according to a combination of character types forming the consecutive character.
Information retrieval device described in.