JPH06208588A

JPH06208588A - Document retrieving system

Info

Publication number: JPH06208588A
Application number: JP5134072A
Authority: JP
Inventors: Yasutsugu Ogawa; 泰嗣小川; Reiko Bessho; 礼子別所
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1992-08-14
Filing date: 1993-05-12
Publication date: 1994-07-26
Anticipated expiration: 2018-10-27
Also published as: JP3460728B2

Abstract

PURPOSE:To make it possible to regard a document as the document concerned even when a retrieving word is not quite the same as the word in the document. CONSTITUTION:A retrieving word is inputted by a user through a retrieving word input means 1. A document point applying means 2 analyzes the morpheme of the inputted retrieving word and compares a word decomposed at its part of speech with a keyword stored in each word in a document, so that even when the retrieving word does not completely coincide with the word in the document, the word can be retrieved. The means 2 applies a point corresponding to each inputted retrieving word to each document. A document ranking means 3 sorts documents to which points are applied in the descending order of points. A document output means 4 outputs a retrieved result to the user.

Description

Detailed Description of the Invention

【０００１】[0001]

【技術分野】本発明は、文書検索方式に関し、より詳細
には、検索語が文書内の語と全く同じでなくとも、該当
文書と見なすことができるようにした文書検索方式に関
する。例えば、文書管理装置や画像管理装置などに適用
されるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search method, and more particularly, to a document search method that allows a search word to be regarded as a corresponding document even if the search word is not exactly the same as the word in the document. For example, it is applied to a document management device, an image management device, and the like.

【０００２】[0002]

【従来技術】本発明に係る従来技術を記載した公知文献
としては以下のものがある。特開平２−２４５８号公報
に提案されている「類似文書検索装置」は、キーワード
を持っていない文書についても、その文書を形態素解析
などをすることで、自動的にキーワードを抽出して所望
の文書を検索できるようにしたもので、検索語を入力す
ると、それに対し類似度の高い文書を出力し、あらかじ
め文書にキーワードが付与されていなくても、文書から
自立語を抽出し、頻度の高いものから順にキーワードと
し、検索語と比較して類似度を判定するものである。し
かしながら、文書内に検索語と全く同じ語が含まれてな
ければ、該当文書と見なされないことになり、文書から
自立語を抽出し、頻度の高いものから順にキーワードと
し、検索語と比較する方法では、単に出現頻度の高い単
語ほど重要ということになり、正確な検索は行なえない
という欠点がある。2. Description of the Related Art The following are known documents which describe the prior art relating to the present invention. The “similar document retrieval apparatus” proposed in Japanese Patent Laid-Open No. 2-2458 / 1990 automatically extracts a keyword from a document that does not have a keyword by performing a morphological analysis, etc. This is a document search function. When a search word is input, a document with a high degree of similarity is output, and independent words are extracted from the document even if keywords are not added to the document in advance, and the frequency is high. The keywords are used in order from the one, and the similarity is determined by comparing with the search word. However, if the document does not contain the exact same word as the search term, it will not be considered as a relevant document, and independent words will be extracted from the document and used as keywords in descending order of frequency and compared with the search term. The method has the drawback that the more frequently appearing words are more important, and an accurate search cannot be performed.

【０００３】また、「意味属性に基づくテキストベース
検索方式」（松尾比呂志外１名情報処理学会編文誌
Vol32,No9,Sep.1991 p1172〜1179）は、多様な表現の類
似関係を扱うために、単語の意味属性に基づいて、検索
指示文を各テキストの見出し文との意味的類似性により
検索するものである。すなわち、見出し語のついた大量
のカードを格納したＤＢ（データベース）から、見出し
文をもとに目的のカードを取り出すもので、文書全体で
なく、見出し文をインデックスとして扱い、検索語と見
出し文の部分的な一致も認めるものである。しかしなが
ら、見出し文を検索の対象としているので、文書全体を
検索の対象とすることはできないという欠点がある。In addition, "text-based retrieval method based on semantic attributes" (Hiroshi Matsuo, 1)
Vol32, No9, Sep.1991 p1172-1179) deals with similarity relations of various expressions by searching the search instruction sentence by the semantic similarity with the headline sentence of each text based on the semantic attribute of the word. It is a thing. That is, the target card is retrieved from a DB (database) that stores a large number of cards with headwords based on the headline sentences. The headline sentences are treated as an index instead of the entire document, and the search words and the headline sentences are used. The partial agreement of is also recognized. However, since the headline sentence is the search target, there is a drawback that the entire document cannot be searched.

【０００４】[0004]

【目的】本発明は、上述のごとき実情に鑑みなされたも
ので、検索語が、文書内の語と全く同じでなくても該当
文書と見なすことができること、また、検索語に応じて
文書中のキーワードに得点を付与するので、正確な検索
を行なうことができること、さらに、文書全体（つまり
見出し文だけでなく）検索の対象とする文書検索方式を
提供することを目的としてなされたものである。[Purpose] The present invention has been made in view of the above circumstances, and it is possible to regard a search word as a corresponding document even if the search word is not exactly the same as the word in the document. Since the score is given to the keyword of, the purpose is to provide an accurate search, and to provide a document search method for searching the entire document (that is, not only the headline sentence). .

【０００５】[0005]

【構成】本発明は、上記目的を達成するために、（１）
入力した検索語を形態素解析する形態素解析手段と、該
形態素解析手段により得られた品詞分解された単語と、
文書中の単語単位で保存されたキーワードとを比較する
比較手段とから成り、検索語と文書中の語が完全に一致
していなくても検索することのできること、更には、
（２）前記検索語と各文書中のキーワードとの一致度を
計算することにより、各文書に検索語に即した得点を付
与すること、更には、（３）前記（２）において、前記
検索語に応じて文書に得点を付与することにより、検索
語に即した文書から順に出力することのできるようにし
たこと、更には、（４）前記（２）において、前記各文
書における検索語に即した得点とは、検索語の単語列の
最語尾の単語に基本点を与え、単語列の前に遡るに従っ
て基本点から重要度を上げていき、該重要度の合計を文
書の得点とすること、更には、（５）前記（２）におい
て、前記検索語と文書の一致度の計算についてはキーワ
ード素性の１つである複合語語基を用いることにより、
文書に得点を付与する際にキーワードとはなりにくい語
には高得点を与えないようにしたこと、更には、（６）
前記（２）において、前記検索語と文書の一致度の計算
についてはキーワード素性の１つである固有名詞構成語
を用いることにより、文書に得点を付与する際にキーワ
ードとはなりにくい語には高得点を与えないようにした
こと、更には、（７）前記（２）において、前記検索語
と文書の一致度の計算についてはキーワード素性の１つ
である接頭修飾を用いることにより、特殊な接頭語には
得点を与えるようにしたこと、更には、（８）前記
（２）において、前記検索語と文書の一致度の計算につ
いてはキーワード素性の１つである地名識別語を用いる
ことにより、文書に得点を付与する際にキーワードとは
なりにくい高得点を与えないようにしたこと、更には、
（９）前記（２）において、前記検索語と文書の一致度
の計算についてはキーワード素性の１つである元号識別
語を用いることにより、文書に得点を付与する際にキー
ワードとはなりにくい語には高得点を与えないようにし
たこと、或いは、（１０）入力した検索語を形態素解析
する形態素解析手段と、該形態素解析手段により得られ
る単語群のそれぞれに重要度を設定する重要度設定手段
と、該重要度から登録文書に付与されている単語群から
構成されるキーワードの一致度を計算する一致度計算手
段と、該一致度からその文書の文書得点を計算する文書
得点計算手段と、該文書得点計算手段により文書を文書
得点順に出力する文書出力手段とから成り、前記一致度
計算手段でキーワードに含まれる単語と一致する検索語
の単語の重要度の積を一致度とすること、更には、（１
１）前記（１０）において、前記一致度計算手段でキー
ワードに含まれる単語並びと検索語に含まれる単語並び
とが一致する場合に一致度が大きくなるようにするこ
と、更には、（１２）前記（１０）において、前記一致
度計算手段でキーワード検索語が完全に一致する際の一
致度が検索語に含まれる単語数に応じて変わらないこ
と、更には、（１３）前記（１０）において、前記文書
得点計算手段で登録文書のキーワードと検索語の一致度
の平均値を文書得点とすること、更には、（１４）前記
（１０）において、前記文書得点計算手段で登録文書の
キーワードと検索語の一致度の和を一致度が１以上とな
ったキーワード数で割った値を文書得点とすること、更
には、（１５）前記（１０）において、前記文書得点計
算手段で登録文書のキーワードと検索語の一致度の最大
値を文書得点とすること、更には、（１６）前記（１
０）において、前記文書得点計算手段で文書中のキーワ
ードの出現位置に応じて文書得点の計算法を変更するこ
と、更には、（１７）前記（１６）において、キーワー
ドの出現位置が文書のタイトルの場合、一致度計算手段
で得られる一致度にある係数をかけた値を用いて文書得
点を計算すること、更には、（１８）前記（１６）にお
いて、キーワードの出現位置が文書の第１段落第１文の
場合、一致度計算手段で得られる一致度にある係数をか
けた値を用いて文書得点を計算すること、更には、（１
９）前記（１６）において、キーワードの出現位置が文
書の第１段落第２文以降の場合、一致度計算手段で得ら
れる一致度にある係数をかけた値を用いて文書得点を計
算すること、更には、（２０）前記（１６）において、
キーワードの出現位置が文書の第２段落以降第１文の場
合、一致度計算手段で得られる一致度にある係数をかけ
た値を用いて文書得点を計算すること、更には、（２
１）前記（１６）において、キーワードの出現位置が文
書の第２段落以降第２文以降の場合、一致度計算手段で
得られる一致度にある係数をかけた値を用いて文書得点
を計算すること、更には、（２２）前記（１０）におい
て、前記文書得点計算手段でキーワードの後続語に応じ
て文書得点の計算法を変更すること、更には、（２３）
前記（２２）において、キーワードの後続語が格助詞
「が」の場合、一致度計算手段で得られる一致度にある
係数をかけた値を用いて文書得点を計算すること、更に
は、（２４）前記（２２）において、キーワードの後続
語が副助詞「は」の場合、一致度計算手段で得られる一
致度にある係数をかけた値を用いて文書得点を計算する
こと、更には、（２５）前記（２２）において、キーワ
ードの後続語が格助詞「を」の場合、一致度計算手段で
得られる一致度にある係数をかけた値を用いて文書得点
を計算すること、更には、（２６）前記（２２）におい
て、キーワードの後続語が格助詞「が」／副助詞「は」
／格助詞「を」以外の場合、一致度計算手段で得られる
一致度にある係数をかけた値を用いて文書得点を計算す
ること、更には、（２７）前記（１０）において、文書
得点計算手段で文書中のキーワードの出現位置および後
続語に応じて文書得点の計算法を変更すること、或い
は、（２８）入力した検索語を形態素解析する形態素解
析手段と、該形態素解析手段によって得られた単語群の
それぞれに重要度を設定する重要度設定手段と、該重要
度設定手段により設定された重要度を用いて登録文書に
付与されているキーワードとの一致度を計算する一致度
計算手段と、該一致度計算手段により計算された一致度
からその文書の文書得点を計算する文書得点計算手段
と、該文書得点計算手段により文書を文書得点順に出力
する文書出力手段とから成り、検索語と各文書中のキー
ワードとの一致度を計算することにより各文書に検索語
に即した得点を付与し、その得点順に文書を出力するこ
と、更には、（２９）前記（２８）において、前記重要
度設定手段で単語の出現位置に応じてその単語の重要度
を設定すること、更には、（３０）前記（２９）におい
て、前記重要度設定手段で単語の重要度設定の際に、検
索語の構成単語数に応じて単語の重要度を設定するこ
と、更には、（３１）前記（２９）において、前記重要
度設定手段で単語の重要度設定の際に、単語の品詞に応
じて重要度を設定すること、更には、（３２）前記（３
１）において、前記重要度設定手段で単語の重要度設定
の際に、単語の品詞で記述されない文法的／意味的な特
徴を記述するキーワード素性に応じて重要度を設定する
こと、更には、（３３）前記（２８）において、前記一
致度計算手段で文書キーワードと検索語の一致度の計算
の際に、キーワードと検索語に共通する単語の重要度の
合計を一致度とすること、更には、（３４）前記（３
３）において、前記一致度計算手段で文書キーワードと
検索語の一致度の計算の際に、キーワードに含まれる単
語並びと検索語に含まれる単語並びが一致する場合に一
致度を大きくすること、更には、（３５）前記（３３）
において、前記一致度計算手段で文書キーワードと検索
語の一致度の計算の際に、キーワードと検索語の未尾の
単語が一致する場合に一致度を大きくすること、更に
は、（３６）前記（３３）において、前記一致度計算手
段で文書キーワードと検索語の一致度の計算の際に、キ
ーワードと検索語の先頭の単語が一致する場合に一致度
を大きくすることを特徴としたものである。以下、本発
明の実施例に基づいて説明する。In order to achieve the above object, the present invention provides (1)
A morpheme analysis means for morphologically analyzing the input search word; a part-of-speech decomposed word obtained by the morpheme analysis means;
Comprising a comparing means for comparing the keywords stored in word units in the document, and it is possible to search even if the search word and the word in the document are not completely matched, further,
(2) A score according to the search word is given to each document by calculating the degree of coincidence between the search word and the keyword in each document, and (3) the search in (2) above. By assigning a score to a document according to a word, it is possible to output documents in order from a document matching a search word. Further, (4) In (2), the search word in each document is added. An appropriate score is to give a basic point to the last word in the word string of the search word, increase the importance from the basic point as going back to the front of the word string, and use the total of the importance as the score of the document. Further, (5) In (2), by using a compound word base that is one of the keyword features for calculating the degree of coincidence between the search word and the document,
When giving scores to a document, high scores are not given to words that are unlikely to be keywords, and (6)
In the above (2), by using the proper noun constituent word which is one of the keyword features for the calculation of the degree of coincidence between the search word and the document, a word that is unlikely to be a keyword when giving a score to a document is used. In addition to not giving a high score, (7) In (2) above, by using a prefix modification, which is one of the keyword features, for the calculation of the degree of coincidence between the search word and the document, By giving a score to the prefix, further, (8) in (2), by using the place name identification word, which is one of the keyword features, in calculating the degree of coincidence between the search word and the document. , When adding scores to a document, we did not give high scores that are unlikely to be keywords, and furthermore,
(9) In (2), by using the era identification word that is one of the keyword features for the calculation of the degree of coincidence between the search word and the document, it is difficult to be a keyword when giving a score to the document. The word is not given a high score, or (10) the morpheme analysis means for morphologically analyzing the input search word, and the importance degree for setting the importance degree to each of the word groups obtained by the morpheme analysis means. Setting means, coincidence degree calculating means for calculating the coincidence degree of a keyword composed of a word group given to a registered document from the importance degree, and document score calculating means for calculating the document score of the document from the coincidence degree And a document output means for outputting the documents by the document score calculation means in the order of the document scores, and the degree of importance of the word of the search word that matches the word included in the keyword by the coincidence degree calculation means. Be coincidence degree, furthermore, (1
1) In the above (10), the degree of coincidence is increased when the word sequence included in the keyword matches the word sequence included in the search word by the coincidence degree calculating means, and further, (12) In the above (10), the matching degree when the keyword search word is completely matched by the matching degree calculation means does not change according to the number of words included in the search word, and (13) In the above (10) The average score of the matching degree between the keyword of the registered document and the search word is used as the document score by the document score calculation means, and (14) in (10), the average score of the registered documents is used as the registered document keyword. A value obtained by dividing the sum of the degree of coincidence of the search terms by the number of keywords having the degree of coincidence of 1 or more is used as the document score, and further, (15) In (10), the document score calculation means of the registered document is used. Ki To a maximum value of the word and the search word matching degree between the document score, further, (16) the (1
In 0), the document score calculation means changes the calculation method of the document score according to the appearance position of the keyword in the document, and (17) In (16), the appearance position of the keyword is the title of the document. In the case of, the document score is calculated using a value obtained by multiplying the degree of coincidence obtained by the coincidence degree calculating means, and (18) in (16), the appearance position of the keyword is the first in the document. In the case of the first sentence of the paragraph, the document score is calculated using a value obtained by multiplying the degree of coincidence obtained by the degree-of-coincidence calculating means by a coefficient, and further, (1
9) In (16) above, when the appearance position of the keyword is in the first paragraph, second sentence or later of the document, the document score is calculated using a value obtained by multiplying the matching degree obtained by the matching degree calculation means by a coefficient. Further, (20) In the above (16),
If the appearance position of the keyword is the first sentence after the second paragraph of the document, the document score is calculated by using the value obtained by multiplying the degree of coincidence obtained by the coincidence degree calculation means by a coefficient, and further, (2
1) In the above (16), when the appearance position of the keyword is from the second paragraph to the second sentence of the document, the document score is calculated using a value obtained by multiplying the coincidence obtained by the coincidence calculation means by a coefficient. (22) In (10), the document score calculation means changes the calculation method of the document score according to the succeeding word of the keyword, and (23)
In the above (22), when the subsequent word of the keyword is the case particle "ga", the document score is calculated using a value obtained by multiplying the degree of coincidence obtained by the coincidence degree calculating means by a coefficient, and (24) ) In (22), when the succeeding word of the keyword is the auxiliary particle “ha”, the document score is calculated using a value obtained by multiplying the matching degree obtained by the matching degree calculation means by a coefficient. 25) In (22), when the succeeding word of the keyword is the case particle "wo", the document score is calculated using a value obtained by multiplying the coefficient of coincidence obtained by the coincidence degree calculating means, and (26) In (22), the succeeding word of the keyword is the case particle “ga” / the sub particle “ha”.
/ When the particle particle is other than "wo", the document score is calculated using a value obtained by multiplying the degree of coincidence obtained by the coincidence degree calculating means, and (27) In (10), the document score is calculated. The calculation means changes the calculation method of the document score according to the appearance position of the keyword in the document and the subsequent word, or (28) the morpheme analysis means for morphologically analyzing the input search word and the morpheme analysis means Degree setting means for setting the degree of importance for each of the selected word groups, and the degree of coincidence calculation for calculating the degree of coincidence with the keyword given to the registered document using the degree of importance set by the degree of importance setting means Means, a document score calculating means for calculating the document score of the document from the coincidence degree calculated by the coincidence degree calculating means, and a document output means for outputting the documents in the document score order by the document score calculating means. The score corresponding to the search word is given to each document by calculating the degree of coincidence between the search word and the keyword in each document, and the documents are output in the order of the score. ), The importance of the word is set by the importance setting means according to the appearance position of the word, and (30) in (29), the importance of the word is set by the importance setting means. In this case, the importance of the word is set according to the number of constituent words of the search word, and further, (31) in (29), when the importance of the word is set by the importance setting means, The importance is set according to the part of speech, and further, (32) above (3)
In 1), when the importance level of the word is set by the importance level setting means, the importance level is set according to a keyword feature that describes a grammatical / semantic feature that is not described by the part of speech of the word. (33) In (28) above, when the degree of coincidence between the document keyword and the search word is calculated by the degree-of-coincidence calculation means, the sum of the importance degrees of the words common to the keyword and the search word is taken as the degree of coincidence. Is (34) above (3
In 3), when the degree of coincidence between the document keyword and the search word is calculated by the degree-of-coincidence calculating means, the degree of coincidence is increased when the word sequence included in the keyword and the word sequence included in the search term match. Furthermore, (35) said (33)
In the calculation of the degree of coincidence between the document keyword and the search word by the degree-of-coincidence calculating means, the degree of coincidence is increased when the keyword and the unseen word of the search term match, further, (36) above In (33), when the degree of coincidence between the document keyword and the search word is calculated by the degree-of-coincidence calculating means, the degree of coincidence is increased when the keyword matches the first word of the search word. is there. Hereinafter, description will be given based on examples of the present invention.

【０００６】図１は、本発明による文書検索方式の一実
施例を説明するための構成図で、図中、１は検索語入力
手段、２は文書得点付与手段、３は文書ランキング手
段、４は文書出力手段、５はキーワードが付与された文
書である。まず、ユーザによって検索語が入力される。
次に、文書得点付与手段２によって、その入力された検
索語に応じた得点が各文書に付与される。なお、ここで
はあらかじめ単語単位に区切られ、キーワードが付与さ
れた文書５が用意されているものとする。次に、文書ラ
ンキング手段３によって、得点が付与された文書を得点
の高い順にソートし、文書出力手段４によって出力され
る。FIG. 1 is a block diagram for explaining an embodiment of a document search system according to the present invention. In the figure, 1 is a search word input means, 2 is a document score giving means, 3 is a document ranking means, 4 Is a document output unit, and 5 is a document to which a keyword is added. First, a user inputs a search term.
Next, the document score assigning means 2 assigns a score to each document according to the input search word. Note that, here, it is assumed that the document 5 is prepared in which words are preliminarily sectioned and the keywords are added. Next, the document ranking means 3 sorts the documents to which the scores have been added in descending order of score, and the document output means 4 outputs the sorted documents.

【０００７】図２は、図１における文書得点付与手段の
動作を説明するためのフローチャートである。step１；検索語を形態素解析にかけ、各単語に品詞を付
与する。step２；それらの各単語に対して、ルールに従って重要
度を与える。step３；各文書のもつキーワードの単語と、検索語の単
語が一部分でも一致したら、さきに検索語の単語に付与
した重要度を与え、そのキーワードごとに重要度を合計
し、キーワードの一致度を計算する。step４；各文書ごとに一致度を合計し、その文書の得点
とする。FIG. 2 is a flow chart for explaining the operation of the document score giving means in FIG. step1 ； The search word is subjected to morphological analysis and a part of speech is added to each word. step2 ； Give importance to each of these words according to the rules. step3 ； If the word of the keyword that each document has and the word of the search word partially match, the degree of importance given to the word of the search word is given, and the degree of importance is summed up for each keyword, and the degree of matching of the keyword is calculated. calculate. step4 ； The degree of coincidence is summed up for each document, and it is set as the score of the document.

【０００８】図２において、「重要度」とは、検索語を
形態素解析してその一語一語に対して付与する値であ
る。「一致度」とは、文書中のキーワードと検索語（部
分）が一致するとそれに相当する検索語の重要度が付与
され、単語ごとに合計された値である。「得点」とは最
終的に一致度が文書ごとに合計されたときの値である。In FIG. 2, "importance" is a value given to each word by morphologically analyzing the search word. The “coincidence degree” is a value obtained by adding the degree of importance of a search word corresponding to a keyword and a search word (part) in the document, and summing up for each word. The “score” is a value when the degree of coincidence is finally summed up for each document.

【０００９】図３は、検索語に対する重要度付与ルール
を説明するためのフローチャートである。なお、前述の
ように検索語は形態素解析され、品詞分解されているも
のとする。まず、最初に重要なことは、ポインタを最後
尾におくことである（step１）。つまり、単語列の最後
尾から順に前に戻りながら処理していくことになる。最
初にｎの値に基本点、sum の値に０をセットする（step
２）。次に、その単語にキーワード素性が付与されてい
るかどうかを判断する（step３）。ここで、付与されて
いるものと付与されていないものに分けられるが、付与
されているものは図３の破線の上の部分の処理（ここで
は phase１と呼ぶ）、付与されていないものは破線の
下の部分の処理（ここでは phase２と呼ぶ）が行なわ
れることになる。キーワード素性については後述する。FIG. 3 is a flow chart for explaining the importance giving rule for a search word. As described above, it is assumed that the search word is morphologically analyzed and decomposed into parts of speech. First, the first important thing is to put the pointer at the end (step 1). In other words, the processing is performed while returning backward from the end of the word string. First, set n as the base point and sum as 0 (step
2). Next, it is determined whether or not the word has a keyword feature (step 3). Here, it is divided into those that are given and those that are not given, but those that have been given are the processing of the part above the broken line in FIG. 3 (herein called phase 1), and those that have not been given are broken lines. The processing of the lower part (here called phase2) will be performed. The keyword features will be described later.

【００１０】最初に phase１、つまりキーワード素性が
付与されているものについての処理を説明する。まず、
そのキーワード素性が「接頭修飾」かどうかを判断する
（step４）。「接頭修飾」とは、後述するが、後続する
語を修飾するはたらきをもつ接頭辞である。「接頭修
飾」がないならば、その単語にｎをセットする（step
５）。そしてsum の値にｎを加算し、ｎの値に１を加算
する（step６）。そしてその単語が単語列の先頭かどう
かを判断し（step７）、先頭でなければ１単語前に戻り
（step８）、step３に戻って同じ処理を繰り返す。つま
り、単語刊の前に進むほどｎおよび sum の値が大きく
なる。先頭であれば、ここでキーワード素性の付与され
たものについての処理は終了し、最後尾にもどって（st
ep１１）phase２の処理に入る。なお、step４でキーワ
ード素性が「接頭修飾」であったものについては、その
語の基本点をセットし（step９）、sum に基本点を加算
する（step１０）。First, the processing for phase 1, that is, a keyword feature is given will be described. First,
It is judged whether the keyword feature is "prefix modification" (step 4). The “prefix modification” is a prefix having a function of modifying a subsequent word, which will be described later. If there is no "prefix modification", set n to that word (step
5). Then, n is added to the value of sum and 1 is added to the value of n (step 6). Then, it is judged whether or not the word is the beginning of the word string (step 7), and if it is not the beginning, the process returns to the previous word (step 8), returns to step 3, and repeats the same processing. In other words, the values of n and sum become larger as the process goes before word publishing. If it is at the beginning, the processing for the keyword feature added ends here, and returns to the end (st
ep11) Start processing of phase2. When the keyword feature is "prefix modification" in step 4, the basic point of the word is set (step 9), and the basic point is added to sum (step 10).

【００１１】次に、phase２の処理にうつる。step１１
で最後尾に戻ったら、phase１で合計してきた sum に１
を加算する（step１２）。次に、phase１と同様にキー
ワード素性の有無を調べる（step１３）。実際には素性
のあるものはすでに phase１で処理されているので、こ
こでは素性の無いものが対象となる。素性のあるものは
単語列の先頭かどうかを確かめ（step１６）、処理を終
了する。さて、step１３で素性の無いものはその単語に
sum をセットする（step１４）。そして次にいままで
の合計 sum にもう一度 sum を加え、さらに１を加算す
る（step１５）。そしてその単語が単語列の先頭かどう
かを判断し（step１６）、先頭でなければ１単語前に戻
り（step１７）、step１２に戻って同じ処理を繰り返
す。つまり、phase２では単語列の前に進むほど sum が
加算されていく。つまり、キーワード素性の付与された
ものは単語列の前に位置するものほど重要度は高くな
り、また、キーワード素性の付与されたものがどれだけ
加算されても（連なっても）キーワード素性の付与され
ない単語の、たとえ１語の方が重要度は高くなる。Next, the process of phase 2 is performed. step11
Then, when I returned to the end, I added 1 to the sum I summed up in phase 1.
Is added (step 12). Next, as in phase 1, the presence or absence of keyword features is checked (step 13). In fact, features with no features have already been processed in phase 1, so here we will target those with no features. If there is a feature, it is confirmed whether it is the head of the word string (step 16), and the process is terminated. By the way, if there is no feature in step 13,
Set sum (step 14). Then, sum is added again to the total sum up to now, and 1 is further added (step 15). Then, it is judged whether or not the word is the head of the word string (step 16), and if it is not the head, the process returns to the previous word (step 17), returns to step 12, and repeats the same processing. In other words, in phase 2, sum is added as it goes to the front of the word string. In other words, a keyword feature is assigned a higher importance as it is positioned before the word string, and a keyword feature is assigned no matter how many keyword features are added (even if they are consecutive). The importance of one word that is not read is higher.

【００１２】ここで、上記の説明でも用いたキーワード
素性について説明する。キーワード素性には、複合語語
基、固有名詞構成語、接頭修飾、地名識別、元号識別の
５種類がある。それぞれの素性が付与され得る品詞と特
徴、役割を次の表１にまとめる。Here, the keyword features used in the above description will be described. There are five types of keyword features: compound word base, proper noun constituent words, prefix modification, place name identification, and era identification. Table 1 below summarizes the parts of speech, the characteristics, and the roles that can be given to each feature.

【００１３】[0013]

【表１】 [Table 1]

【００１４】「接頭修飾」以外は、単独で出現した場合
キーワードとなりにくい、または識別性が薄いという特
徴をもつ。「装置」だけをみてもこれだけでは特徴のあ
る語とはいえない。また、「地名識別語」「元号識別
語」も同様である。「東京」といっても「東京大学」
「東京〇〇会社」「東京〇〇学校」「〇〇会社東京支
店」というように、一致する語は多く、「東京」単独で
は文書中にマッチする語は多数ある。そうした意図か
ら、これらキーワード素性の付与された語は単語列の前
に位置するにしても１点ずつしか重要度は上げなかっ
た。逆にキーワード素性のない一般名詞や固有名詞は s
um により重要度が高くなる。なお、「接頭修飾」は他
の素性とは少し異なる。通常、接頭辞はキーワードとは
見なされないほどだが、例えば「新」や「大」など後続
の語を修飾する働きが大きいと思われる接頭辞が「接頭
修飾」である。これらについては基本点だけを与えるこ
とにした。Other than the "prefix modification", when it appears alone, it is less likely to be a keyword, or has a characteristic that the distinctiveness is low. Even if we look only at "devices," this alone is not a characteristic word. The same applies to "place name identifiers" and "era name identifiers". "Tokyo" is "University of Tokyo"
There are many matching words such as "Tokyo company", "Tokyo school", "Tokyo company Tokyo branch", etc., and there are many matching words in the document for "Tokyo" alone. From such an intention, even if these words to which the keyword features are added are located in front of the word string, the importance is increased by one point at a time. On the other hand, the general noun or proper noun without a keyword feature is s
um makes it more important. "Prefix modification" is a little different from other features. Usually, the prefix is not considered to be a keyword, but the prefix that is considered to have a great effect on modifying subsequent words such as "new" and "large" is "prefix modification". For these, I decided to give only the basic points.

【００１５】次に、以下の語が検索語となった場合を例
にとって、上のルールを説明する。例１慶応大学医科学研究所．形態素解析して品詞単位に分解する。（形態素解析結果）慶応大学医科学研究所 → 慶応／大学／医／科学／研
究／所．ルールに従って単語ごとに重要度をつける。Next, the above rule will be described by taking the case where the following words are search words as an example. Example 1 Keio University Institute of Medical Science . Morphological analysis is performed and decomposed into parts of speech. (Results of morphological analysis) Keio University Institute of Medical Science → Keio / University / Medicine / Science / Research / Center Assign importance to each word according to the rules.

【００１６】[0016]

【表２】 [Table 2]

【００１７】重要度（得点）はこのように、まず単語列
の末尾の単語に基本点（ここでは２点）を与える。キー
ワード素性の付与された単語については、その直前の単
語に順次１点を加えていくという処理を繰り返す。キー
ワード素性のつかないもの（ここでは「慶応」）は、そ
れまでの重要度の全ての合計にさらに１を加える。これ
は、たとえ「大学医科学研究所」というキーワードを含
む文書が存在したとしても、「慶応」というキーワード
を含む文書の方が重要と見なすためである。例２新素材研究開発．形態素解析して品詞単位に分解する。（形態素解析結果）新素材研究開発 → 新／素材／研究／開発．ルールに従って単語ごとに重要度をつける。As described above, the importance (score) is calculated as follows.
The base point (here, 2 points) is given to the last word of. Key
For words with word features,
The process of adding one point to each word is repeated. Key
Words that do not have features (here, "Keio") are
Add one more to the total of all priorities. this
Includes the keyword "University Institute of Medical Science"
Even if there is a document, the keyword "Keio"
This is because it is considered that the document including is more important. Example 2Research and development of new materials ． Morphological analysis is performed and decomposed into parts of speech. (Results of morphological analysis) New material R & D → New / Material / Research / Development. Assign importance to each word according to the rules.

【００１８】[0018]

【表３】 [Table 3]

【００１９】接頭辞の扱いと、キーワード素性の付与さ
れていない語が単語列の先頭以外にある場合の扱いの例
である。キーワード素性「接頭修飾」の付与された接頭辞は、
付与されない接頭辞とは点数上で差をつけるため、基本
点（２点）を与える。例１ではキーワード素性のないものは単語列の先頭に
あったので、最後尾の単語列の重要度から順に計算して
いた。この例２はキーワード素性のない語（この場合
「素材」）が単語列の中ほどにあるが、流れは同じであ
る。その単語に対しての重要度を最も重くしたいので、
それ以外の語の重要度の合計にさらに１を加えて「素
材」の重要度とした。An example of the handling of the prefix and the handling of the word to which the keyword feature is not given is at a position other than the beginning of the word string. The prefix with the keyword feature "prefix modification" is
A basic point (2 points) is given in order to make a difference in score from the prefix not given. In Example 1, those without a keyword feature were at the beginning of the word string, and were therefore calculated in order from the importance of the last word string. In this example 2, the word having no keyword feature (in this case, "material") is in the middle of the word string, but the flow is the same. I want to give the most importance to that word, so
The importance of "material" was added by adding 1 to the total importance of other words.

【００２０】ここまでで、図２のstepの２の処理が終了
したことになる。こうして検索語に重要度が付与され
た。次に、この重要度を用いて文書ごとに得点を与え
る。得点は、図２のstep３，step４で述べたように、各
文書のキーワードの単語と検索語の単語が一致したら
（たとえ部分一致でも）検索語の単語に付与した重要度
を与え、各単語の一致度を求め、最終的にそれら一致度
を合計することによって得られる。前述の例２「新素材
研究」を用いて得点付与の方法を説明する。つまり、
「新素材研究」を検索語とした場合である。もう一度こ
の検索語の単語ごとの重要度を示す。Up to this point, the processing of step 2 in FIG. 2 has been completed. In this way, the search terms are given importance. Next, a score is given for each document using this importance. As described in step 3 and step 4 of FIG. 2, when the word of the keyword of each document and the word of the search word match (even if they partially match), the score is given to the importance of the word of the search word. It is obtained by finding the degree of coincidence and finally summing them. The method of scoring will be described using the above-mentioned Example 2, “New Material Research”. That is,
This is the case when "new material research" is used as the search term. The importance of each word of this search term is shown once again.

【００２１】[0021]

【表４】 [Table 4]

【００２２】次に、ある文書に次のようなキーワードが
記述されていたとする。このとき、文書中の各キーワー
ドは次のように一致度が算出される。Next, it is assumed that the following keywords are described in a certain document. At this time, the degree of coincidence of each keyword in the document is calculated as follows.

【００２３】[0023]

【表５】 [Table 5]

【００２４】一致度が算出されたら、文書ごとのその一
致度を合計する。この値がその文書の得点である。例え
ば、この文書でいうば１３＋１１＋１０＝３４というこ
とになり、得点は３４点ということになる。こうして全
ての文書に得点が付与されたら文書ランキング手段によ
って得点がソートされ、得点の高い文書から文書出力手
段によって出力される。When the degree of coincidence is calculated, the degree of coincidence for each document is summed up. This value is the score for the document. For example, in this document, this means 13 + 11 + 10 = 34, and the score is 34 points. In this way, when the scores are given to all the documents, the scores are sorted by the document ranking means, and the documents having the highest scores are outputted by the document output means.

【００２５】図４は、本発明による文書検索方式の他の
実施例を説明するための図で、図中、１１は検索語入力
手段、１２は文書得点付与手段、１３は文書ランキング
手段、１４は文書出力手段である。検索語入力手段１１
は、ユーザの検索語を入力する。文書得点付与手段１２
は、入力検索語に応じた得点が全登録文書に対して付与
される。なお、各登録文書にはあらかじめ単語単位に区
切られているキーワードが付与されている。文書ランキ
ング手段１３は、登録文書を文書得点の高い順にソート
する。文書出力手段１４は、ユーザに検索結果を出力す
る。FIG. 4 is a diagram for explaining another embodiment of the document search system according to the present invention. In the figure, 11 is a search word input means, 12 is a document score giving means, 13 is a document ranking means, and 14 is a document ranking means. Is a document output means. Search term input means 11
Enter the user's search term. Document score giving means 12
Is given to all registered documents according to the input search word. It should be noted that keywords registered in advance in units of words are added to each registered document. The document ranking unit 13 sorts the registered documents in descending order of document score. The document output unit 14 outputs the search result to the user.

【００２６】図５は、図４における文書得点付与手段の
構成図で、図中、２１は形態素解析手段、２２は重要度
付与手段、２３は一致度計算手段、２４は文書得点計算
手段、２５は登録文書である。形態素解析手段２１は検
索語を形態素解析におけ、各単語に品詞を付与する。重
要度付与手段２２において、重要度とは、検索語の形態
素解析した結果得られる各単語に付与される各単語の重
要性を表す値である。後述するルールに従って各単語ご
とに重要度を計算する。一致度計算手段２３において、
一致度とは、登録文書２５に付与されている各キーワー
ドと検索語の一致の程度を表す値である。文書得点計算
手段２４において、文書得点とは、登録文書と検索語の
一致の程度を表す値である。登録文書に付与されている
各キーワードと検索語の一致度から計算される。FIG. 5 is a block diagram of the document score assigning means in FIG. 4, in which 21 is a morpheme analyzing means, 22 is an importance assigning means, 23 is a matching score calculating means, 24 is a document score calculating means, and 25 is a document score calculating means. Is a registered document. The morphological analysis means 21 assigns a part of speech to each word in the morphological analysis of the search word. In the importance assigning means 22, the importance is a value representing the importance of each word given to each word obtained as a result of morphological analysis of search words. The importance is calculated for each word according to the rule described later. In the coincidence calculation means 23,
The degree of matching is a value indicating the degree of matching between each keyword assigned to the registered document 25 and the search word. In the document score calculation means 24, the document score is a value indicating the degree of coincidence between the registered document and the search word. It is calculated from the degree of coincidence between each keyword assigned to the registered document and the search term.

【００２７】以下、各手段について具体的に説明する。重要度付与手段まず、検索語の形態素解析した結果得られる各単語に付
与される重要度の計算方法を説明する。重要度はつぎの
ルールにしたがって計算される。検索語ないで最も語尾に近い品詞群１の単語の重要度
は基本点とする。それ以外の品詞群１の重要度は、その位置より最も近
い後方にある品詞群１の重要度に増加分を加えた値とす
る。キーワード素性「接頭修飾」付の接頭辞の重要度は基
本点とする。キーワード素性「接頭修飾」なしの接頭辞の重要度は
０とする。品詞群２の重要度は、（１）の品詞群１の重要度の合
計、（２）接頭修飾付の接頭語の重要度、（３）その位
置より後方にある品詞群２の重要度の合計の３つを合計
に増加分を加えた値とする。上述以外の単語の重要度は０とする。Each means will be specifically described below. Importance degree giving means First, a method of calculating the importance degree given to each word obtained as a result of morphological analysis of a search word will be described. Importance is calculated according to the following rules. The importance of the word in the part-of-speech group 1 which is the closest to the end without any search word is the basic point. The degree of importance of the part-of-speech group 1 other than that is set to a value obtained by adding an increment to the degree of importance of the part-of-speech group 1 that is closest to the position and rearward. The importance of the prefix with the keyword feature "prefix modification" is the basic point. The importance of the prefix without the keyword feature "prefix modification" is 0. The importance of part-of-speech group 2 is the sum of the importances of part-of-speech group 1 in (1), (2) the importance of a prefix with prefix modification, and (3) the importance of part-of-speech group 2 behind the position. Three of the totals is the value obtained by adding the increment to the total. The importance of words other than the above is 0.

【００２８】ただし、品詞群１とは、（１）キーワード
素性「複合語語基」付の一般名詞、（２）キーワード素
性「固有名詞構成語」付の固有名詞、（３）キーワード
素性「地名識別語」付の固有名詞、（４）キーワード素
性「元号識別語」付の固有名詞である。品詞群２とは、
（１）キーワード素性なしの一般名詞、（２）キーワー
ド素性なしの固有名詞、（３）数詞、（４）接尾辞、
（５）未登録語などである。キーワード素性はつぎの表
６のようにまとめられる。However, the part-of-speech group 1 is (1) a general noun with the keyword feature “compound word base”, (2) a proper noun with the keyword feature “proper noun constituent word”, and (3) a keyword feature “place name”. It is a proper noun with "identifier", and (4) keyword feature proper noun with "era identifier." What is part-of-speech group 2?
(1) General nouns without keyword features, (2) Proper nouns without keyword features, (3) Numbers, (4) Suffix,
(5) An unregistered word or the like. The keyword features are summarized in Table 6 below.

【００２９】[0029]

【表６】 [Table 6]

【００３０】このルールによる重要度付与の処理フロー
は前述した図３に示してある。以下、重要度付の例を以
下の表７、表８に示す。なお、ここでは基本点を２点、
増加分を１点としている。The processing flow for assigning importance according to this rule is shown in FIG. 3 described above. Hereinafter, examples with importance are shown in Tables 7 and 8 below. There are 2 basic points here,
The increment is 1 point.

【００３１】[0031]

【表７】 [Table 7]

【００３２】[0032]

【表８】 [Table 8]

【００３３】一致度計手段つぎに、検索語の各単語の重要度をもとにキーワードと
検索語の一致度の計算方法を説明する。前述した図１，
図２の実施例では、キーワードに含まれる単語と一致す
る検索語の単語の重要度の合計を一致度としていた。こ
れに対し、請求項１０の方式では、キーワードに含まれ
る単語と一致する検索語の単語の重要度の積を一致度と
する。以下に、重要度付与の例２に示した「新素材研究
開発」を検索語として、キーワードを「新素材研究」
「素材研究」「研究素材」と変えた場合の一致度計算の
例を以下の表９に示す。 Matching Degree Means Means Next, a method of calculating the matching degree between the keyword and the search word based on the importance of each word of the search word will be described. Figure 1 above
In the embodiment of FIG. 2, the degree of coincidence is the sum of the degrees of importance of the search words that match the words included in the keyword. On the other hand, in the method according to the tenth aspect, the product of the degrees of importance of the search words that match the words included in the keyword is taken as the degree of coincidence. Below, "new material research and development" shown in Example 2 of assigning importance is used as a search term, and the keyword is "new material research".
Table 9 below shows an example of the calculation of the degree of coincidence when the terms “material research” and “research material” are changed.

【００３４】[0034]

【表９】 [Table 9]

【００３５】請求項１１の方式では、一致度計算手段で
キーワードに含まれる単語並びと検索語に含まれる単語
並びとが一致する場合に一致度が大きくなるという特徴
がある。そのため、新たに「隣接点」を導入し、キーワ
ードに含まれる単語並びと検索語に含まれる単語並びと
が一致ごとに一致度に隣接点をかけることとする。再
び、検索語を「新素材研究開発」としてキーワードを変
えた場合の一致度の計算例を以下の表１０に示す。な
お、ここでは隣接点を２点としている。The method of claim 11 is characterized in that the degree of coincidence becomes large when the coincidence degree calculating means coincides the word sequence included in the keyword with the word sequence included in the search word. Therefore, a new "adjacent point" is introduced, and the adjacency point is multiplied by the degree of coincidence for each match between the word sequence included in the keyword and the word sequence included in the search word. Again, Table 10 below shows an example of calculation of the degree of coincidence when the keyword is changed and the search term is “new material research and development”. Here, the adjacent points are two.

【００３６】[0036]

【表１０】 [Table 10]

【００３７】キーワードが「新素材研究」の場合、検索
語とキーワードが完全に一致しており、構成単語におい
て「新」と「素材」および「素材」と「研究」の並びが
ともに一致している。したがって、一致度の計算におい
て隣接点（２点）を２回かけている（表１０では、アン
ダーラインで示している）。請求項１０と請求項１１を
比較する。請求項１０では「素材研究」と「研究素材」
に対する一致は同じで２４となっていた。しかし、請求
項１１の方式ではキーワードと検索語の語順を考慮する
ため、請求項１１では検索語「新素材研究開発」と部分
的に語順の一致する「素材研究」の一致度が２倍され、
４８となっている。When the keyword is “new material research”, the search word and the keyword are completely the same, and the sequences of “new” and “material” and “material” and “research” are the same in the constituent words. There is. Therefore, adjacent points (2 points) are multiplied twice in the calculation of the degree of coincidence (indicated by underlining in Table 10). The claims 10 and 11 are compared. In claim 10, “material research” and “research material”
Was the same as 24. However, in the method of claim 11, since the word order of the keyword and the search word is taken into consideration, in claim 11, the degree of coincidence between the search word “new material research and development” and “material research” whose word order partially matches is doubled. ,
It is 48.

【００３８】請求項１２の方式では、キーワードと検索
語が完全に一致する際に一致度が検索語に含まれる単語
数に応じて変わらないという特徴がある。そのため、新
たに「正規化係数」を導入し、キーワードと検索語が完
全一致する場合に一致度が正規化係数になるようにす
る。まず、検索語の構成単語の重要度から検索語の得点
を計算する。検索語得点はキーワードが検索語に等しい
場合の一致度である。例えば、一致度計算法が請求項１
１の方式であれば、検索語「新素材研究開発」の検索語
得点は２×３×２×２×２×２＝７６８となる。正規化
はキーワードと検索語の一致度を検索語得点文書で割
り、正規化係数をかけることで行なう。例えば、正規化
係数を１０００点とし、検索語とキーワードが一致する
場合の一致度はつぎの表１１のようになる。The method of claim 12 is characterized in that, when the keyword and the search word completely match, the degree of matching does not change depending on the number of words included in the search word. Therefore, a "normalization coefficient" is newly introduced so that the degree of coincidence becomes the normalization coefficient when the keyword and the search word completely match. First, the score of the search word is calculated from the importance of the constituent words of the search word. The search word score is the degree of matching when the keyword is equal to the search word. For example, the coincidence calculation method is claim 1.
In the case of the method of 1, the search word score of the search word “new material research and development” is 2 × 3 × 2 × 2 × 2 × 2 = 768. Normalization is performed by dividing the degree of coincidence between the keyword and the search word by the search word score document and applying a normalization coefficient. For example, if the normalization coefficient is 1000 points and the search word and the keyword match, the degree of matching is as shown in Table 11 below.

【００３９】[0039]

【表１１】 [Table 11]

【００４０】正規化しない場合、検索語によって一致度
が異なっているが、正規化処理により検索語によらず一
致度が等しくなる。また、検索語を「新素材研究開発」
として、キーワードを変えた場合の一致度計算例を以下
の表１２に示す。When the normalization is not performed, the degree of coincidence differs depending on the search word, but the normalization process makes the degree of coincidence equal regardless of the search word. In addition, the search term is "new material research and development"
As an example, Table 12 below shows an example of calculation of the degree of coincidence when the keyword is changed.

【００４１】[0041]

【表１２】 [Table 12]

【００４２】文書得点計算手段最後に、キーワードと検索語の一致度をもとに文書得点
の計算方法を説明する。図１，図２に示す実施例では、
登録文書に付与されている各キーワードと検索語の一致
度の登録文書の全キーワードに関する和を文書得点とし
ていた。そのため、登録文書に付与されているキーワー
ド数が多いと文書得点が大きくなってしまう欠点があっ
たが、請求項１３あるいは請求項１４の方式では、キー
ワード数に依存しにくい。請求項１３の方式では、登録
文書の各キーワードと検索語の一致度の平均値を文書得
点とする。すなわち、登録文書の各キーワードと検索語
の一致度の和をその文書のキーワード数で割った値を文
書得点とする。例として、「新素材研究開発」を検索
語、文書に付与されたキーワードを「新素材研究」「素
材研究」「研究素材」「リコー」として場合を以下の表
１３に示す。 Document Score Calculation Means Finally, a method for calculating the document score based on the degree of coincidence between the keyword and the search word will be described. In the embodiment shown in FIGS. 1 and 2,
The sum of the degree of coincidence between each keyword assigned to the registered document and the search word for all the keywords in the registered document was used as the document score. Therefore, there is a drawback that the score of the document becomes large when the number of keywords added to the registered document is large, but the method of claim 13 or claim 14 hardly depends on the number of keywords. According to the method of claim 13, the average value of the coincidence between each keyword of the registered document and the search word is used as the document score. That is, a value obtained by dividing the sum of the degrees of coincidence between each keyword of the registered document and the search word by the number of keywords of the document is used as the document score. As an example, Table 13 below shows a case where “new material research and development” is used as a search term and keywords added to documents are “new material research”, “material research”, “research material”, and “Ricoh”.

【００４３】[0043]

【表１３】 [Table 13]

【００４４】この文書のキーワード数が４なので、一致
度の和を４で割っている。請求項１４の方式では、登録
文書の各キーワードと検索語の一致度の和を一致度が１
以上となったキーワード数で割った値を文書得点とす
る。Since the number of keywords in this document is 4, the sum of coincidences is divided by 4. According to the method of claim 14, the sum of the degrees of coincidence between each keyword of the registered document and the search word is 1
The value obtained by dividing the number of keywords as above is used as the document score.

【００４５】[0045]

【表１４】 [Table 14]

【００４６】請求項１３とは異なり、一致度が１以上と
なったキーワード数が３なので、一致度の和を３で割っ
ている。請求項１５の方式では、登録文書の各キーワー
ドと検索語の一致度の最大値を文書得点とする。Unlike the thirteenth aspect, since the number of keywords whose matching degree is 1 or more is 3, the sum of matching degrees is divided by 3. In the method of claim 15, the maximum value of the degree of coincidence between each keyword of the registered document and the search word is used as the document score.

【００４７】[0047]

【表１５】 [Table 15]

【００４８】次に、請求項１６の実施例について説明す
る。重要度付与および一致度計算方式は前述の実施例と
同じなので、説明を省略する。以下では、文書得点計算
法を説明する。文書得点計算とは、登録文書に付与され
ている各キーワードと検索語の一致度から文書得点を計
算することである。前述の実施例では複数の計算方式を
提案したが、以下では平均値方式を説明に用いる。ただ
し、最大値方式などにも本発明で提案する方式を適用す
ることは可能である。平均値方式では、登録文書の各キ
ーワードと検索語の一致度の平均値を文書得点とする。
例として、「新素材研究開発」を検索語、文書に付与さ
れたキーワードを「新素材研究」「素材研究」「研究素
材」「リコー」とした場合を示す。Next, an embodiment of claim 16 will be described. The method of assigning importance and the method of calculating the degree of coincidence are the same as those in the above-mentioned embodiment, and therefore their explanations are omitted. The document score calculation method will be described below. The document score calculation is to calculate a document score from the degree of matching between each keyword added to the registered document and the search word. Although a plurality of calculation methods have been proposed in the above-described embodiments, the average value method will be used for the description below. However, the method proposed in the present invention can be applied to the maximum value method and the like. In the average value method, the average value of the coincidence between each keyword of the registered document and the search word is used as the document score.
As an example, the case where “new material research and development” is used as the search term and the keywords added to the document are “new material research”, “material research”, “research material” and “Ricoh” are shown.

【００４９】[0049]

【表１６】 [Table 16]

【００５０】本発明の請求項１６の方式では、キーワー
ドの出現位置によって文書得点の計算結果が変わる。一
般に文書中の出現位置によってキーワードの重要性は異
なるため、出現位置によって文書得点の計算結果を変え
ることでユーザの要求にあった検索結果をもとめるのに
有効である。構成（１７）では、キーワードの出現位置
がタイトルの場合、一致度計算手段で得られる一致度
（オリジナル一致度）にタイトル用係数をかけた値（重
みつき一致度）を文書得点計算に用いる。先ほどの例
で、各キーワードの出現位置はつぎの表に示す通りであ
ったとする。ここでタイトル用係数を２とした場合、タ
イトルに出現した「素材研究」の重みつき一致度は６１
×２＝１２２と計算される。その結果、文書得点も以前
の値と異なっている。In the method of claim 16 of the present invention, the calculation result of the document score changes depending on the appearance position of the keyword. Generally, the importance of the keyword varies depending on the appearance position in the document, and it is effective to find the search result that meets the user's request by changing the calculation result of the document score depending on the appearance position. In configuration (17), when the appearance position of the keyword is a title, the value obtained by multiplying the matching score (original matching score) obtained by the matching score calculation unit by the title coefficient (weighted matching score) is used for the document score calculation. In the above example, assume that the appearance positions of the keywords are as shown in the following table. Here, when the coefficient for title is 2, the weighted coincidence degree of “material research” appearing in the title is 61.
Calculated as × 2 = 122. As a result, the document score also differs from the previous value.

【００５１】[0051]

【表１７】 [Table 17]

【００５２】構成（１８）〜（２１）では、キーワード
の出現位置がそれぞれ第１段落第１文、第１段落第２文
以降、第２段落以降第１文、第２段落以降第２文以降の
場合に係数をかけた重みつき一致度を文書得点の計算に
用いる。先ほどの例で、第１段落第１文用係数を１.
５、第１段落第２文以降用係数を１.２、第２段落以降
第１文用係数を１、第２段落以降第２文以降用係数を
０.８とした場合の文書得点の計算をつぎの表１８に示
す。In the configurations (18) to (21), the appearance positions of the keywords are respectively the first sentence of the first paragraph, the second sentence of the first paragraph, the second sentence of the second paragraph and the first sentence of the second paragraph, and the second sentence of the second paragraph and later. In the case of, the weighted coincidence with the coefficient is used for the calculation of the document score. In the example above, the first sentence first sentence coefficient is 1.
5, the calculation of the document score when the coefficient for the second sentence after the first paragraph is 1.2, the coefficient for the first sentence after the second paragraph is 1, and the coefficient for the second sentence after the second paragraph is 0.8 Is shown in Table 18 below.

【００５３】[0053]

【表１８】 [Table 18]

【００５４】本発明の請求項１７の方式では、キーワー
ドの後続語によって文書得点の計算結果が変わる。一般
にキーワードの後続語によってキーワードの重要性は異
なるため、後続語によって文書得点の計算結果を変える
ことでユーザの要求にあった検索結果をもとめるのに有
効である。構成（２３）では、キーワードの後続語が格
助詞「が」の場合、一致度計算手段で得られる一致度
（オリジナル一致度）に「が」用係数をかけた値（重み
つき一致度）を文書得点計算に用いる。先ほどの例で、
各キーワードの後続語はつぎの表に示す通りであったと
する。ここで「が」用係数を２とした場合、後続語が
「が」である「新素材研究」の重みつき一致度は２４９
×２＝４９８と計算される。その結果、文書得点も以前
の値と異なっている。In the method of claim 17 of the present invention, the calculation result of the document score changes depending on the subsequent word of the keyword. Generally, the importance of a keyword varies depending on the succeeding word of the keyword. Therefore, it is effective to find the search result that meets the user's request by changing the calculation result of the document score depending on the succeeding word. In the configuration (23), when the subsequent word of the keyword is the case particle “ga”, the value obtained by multiplying the matching degree (original matching degree) obtained by the matching degree calculation means by the “ga” coefficient (weighted matching degree) is used. Used for document score calculation. In the example above,
It is assumed that the succeeding words of each keyword are as shown in the following table. Here, when the coefficient for “ga” is set to 2, the weighted coincidence degree of “new material research” whose subsequent word is “ga” is 249.
Calculated as x2 = 498. As a result, the document score also differs from the previous value.

【００５５】[0055]

【表１９】 [Table 19]

【００５６】構成（２４）〜（２６）では、キーワード
の後続語がそれぞれ副助詞「は」、格助詞「を」、格格
助詞「が」副助詞「は」／格助詞「を」以外（その他）
の場合に係数をかけた重みつき一致度を文書得点の計算
に用いる。先ほどの例で、「は」用係数を１.５、
「を」用係数を１、その他用係数を０.５とした場合の
文書得点の計算をつぎの表２０に示す。In the configurations (24) to (26), the subsequent words of the keyword are other than the sub particle "ha", the case particle "wo", the case particle "ga", and the sub particle "wa" / case particle "wo" (others). )
In the case of, the weighted coincidence with the coefficient is used for the calculation of the document score. In the example above, the coefficient for "ha" is 1.5,
The following table 20 shows the calculation of the document score when the coefficient for "wa" is 1 and the coefficient for other is 0.5.

【００５７】[0057]

【表２０】 [Table 20]

【００５８】本発明の構成（２７）の方式では、キーワ
ードの出現位置および後続語によって文書得点の計算結
果が変わる。構成（１６）〜構成（２６）で導入された
ものをまとめて適用し、文書得点を計算する。先ほどの
例では、つぎのように文書得点が計算される。In the method of the configuration (27) of the present invention, the calculation result of the document score changes depending on the appearance position of the keyword and the succeeding word. The documents introduced in the configurations (16) to (26) are collectively applied to calculate the document score. In the previous example, the document score is calculated as follows.

【００５９】[0059]

【表２１】 [Table 21]

【００６０】以上に説明した文書検索方式では次のこと
を特徴とするものであった。ユーザが入力する検索語と文書に付与されているキー
ワードが部分的に一致する際にも検索できる。検索の際、検索語とキーワードの一致の程度（一致
度）が計算される。そのため、次のステップにしたがっ
て検索処理が実施される。Ｓ１；検索語を形態素解析することで単語分割する。Ｓ２；その単語ごとの重要度を設定する。Ｓ３；検索語とキーワードの共通する単語の重要度から
一致度を計算する。しかし、この方式はいくつかの改善点がある。（ａ）前記Ｓ２の重要度設定において、検索語を２回に
わたって後ろから前に走査する必要があった。そのた
め、重要度設定が複雑である。（ｂ）前記Ｓ３の一致度計算において、前記段落番号
（００２２）〜（００２４）では、キーワードと検索語
の単語の順序を無視していたため、単語順の異なるキー
ワードに対しても一致度が同じ値になる。例えば、この
方式では「素材研究」と「研究素材」のような同じ構成
単語から成る語順の異るキーワードを区別できなかっ
た。（ｃ）前記段落番号（００３５）に示すように、隣接点
を導入することで語順の異なるキーワードの区別ができ
るが、一致度の計算に積演算を用いていた。一般に、コ
ンピュータにおいて積演算は和演算よりも演算速度が遅
いため、この方式は文書検索が遅くなる。The document retrieval method described above is characterized by the following. The search can be performed even when the search word input by the user partially matches the keyword assigned to the document. At the time of search, the degree of matching (matching degree) between the search word and the keyword is calculated. Therefore, the search process is performed according to the following steps. S1: The search word is divided into words by morphological analysis. S2: The importance of each word is set. S3: The degree of coincidence is calculated from the degree of importance of the words having the search word and the keyword in common. However, this method has some improvements. (A) In the importance setting of S2, it is necessary to scan the search word twice from back to front. Therefore, the importance setting is complicated. (B) In the calculation of the degree of coincidence in S3, in paragraph numbers (0022) to (0024), since the order of the keyword and the word of the search word is ignored, the degree of coincidence is the same even for a keyword having a different word order. It becomes a value. For example, with this method, it is not possible to distinguish between "material research" and "research material" with different word order consisting of the same constituent words. (C) As shown in the paragraph number (0035), keywords having different word orders can be distinguished by introducing adjacent points, but a product operation is used to calculate the degree of coincidence. Generally, in a computer, the product operation is slower than the sum operation, so that the document search is slow in this method.

【００６１】以下に説明する実施例では、前記改善点
（ａ）については、検索語の走査を１回ですむようにす
る。改善点（ｂ）については、一致度計算において単語
順が一致する場合、単語順の一致に応じてボーナス得点
を与えるようにする。改善点（ｃ）については、一致度
計算に積演算を用いないようにするものである。図６
は、本発明による文書検索方式の更に他の実施例を説明
するための構成図で、図中、３１は文書検索手段、３２
は検索語入力手段、３３は文書得点付与手段、３４は文
書ソート手段、３５は文書出口手段、３６は索引語ファ
イル、３７は文書ファイル、３８は文書登録手段であ
る。In the embodiment described below, for the improvement point (a), the search word is scanned once. Regarding the improvement point (b), when the word order matches in the calculation of the matching score, a bonus score is given according to the word order match. Regarding the improvement point (c), the product operation is not used for the coincidence calculation. Figure 6
Is a block diagram for explaining still another embodiment of the document retrieval system according to the present invention, in which 31 is a document retrieval means and 32 is a document retrieval means.
Is a search word input means, 33 is a document score giving means, 34 is a document sort means, 35 is a document exit means, 36 is an index word file, 37 is a document file, and 38 is a document registration means.

【００６２】文書登録手段３８は、ユーザが入力した文
書とそれに付与されているキーワードを文書ファイルと
索引語ファイルに保存する。１つの登録文書には複数の
キーワードが設定可能であり、１つのキーワードは複数
の構成単語からなる複合語であってもよい（例えば、
「文書検索」は「文書」と「検索」の２単語から構成さ
れる複合語である）。索引語ファイル３６では、登録文
書ごとの（複数の）キーワードを識別可能な構成をと
る。文書検索手段３１は、ユーザが入力した検索語に一
致する文書を索引語ファイル３６を用いて探しだし、結
果をユーザに提示する。文書検索は、検索語入力手段３
２と文書得点付与手段３３と文書ソート手段３４と文書
出力手段３５との４つの手段から構成されている。検索
語入力手段３２では、ユーザの検索語を入力する。文書
得点付与手段３３では、入力検索語に応じた得点を全登
録文書に対して計算する。文書ソート手段３４では、登
録文書を文書得点の高い順にソートする。文書出力手段
３５では、ユーザに検索結果を出力する。The document registration means 38 saves the document input by the user and the keywords attached thereto in the document file and the index word file. A plurality of keywords can be set in one registration document, and one keyword may be a compound word composed of a plurality of constituent words (for example,
"Document search" is a compound word composed of two words, "document" and "search". The index word file 36 has a structure in which (a plurality of) keywords for each registered document can be identified. The document search means 31 searches the index word file 36 for a document that matches the search word input by the user, and presents the result to the user. The document search is performed by the search word input means 3
2, a document score assigning means 33, a document sorting means 34, and a document output means 35. The search word input means 32 inputs the search word of the user. The document score assigning means 33 calculates a score according to the input search word for all registered documents. The document sorting means 34 sorts the registered documents in descending order of document score. The document output means 35 outputs the search result to the user.

【００６３】図７は、図６における文書得点付与手段の
構成図で、図中、４１は形態素解析手段、４２は重要度
設定手段、４３は文書得点計算手段、４４は一致度計算
手段である。形態素解析手段４１は検索語を形態素解析
し、単語に分割するとともに単語ごとに品詞を判定す
る。なお、本発明の文書検索装置では、ユーザの入力す
る検索語として複数の単語から構成される複合語を使用
できる。重要度設定手段４２において、重要度とは、検
索語の形態素解析した結果得られる各単語に付与される
各単語の重要性を表す値である。設定方法の詳細につい
ては後述する。文書得点計算手段４３において、文書得
点とは、登録文書と検索語の一致の程度を表す値であ
る。登録文書に付与されている各キーワードとの検索語
の一致度から計算される。ここで、一致度とは、登録文
書に付与されている各キーワードと検索語の一致の程度
を表す値である。検索語の各単語の重要度から計算され
るが、計算方法の詳細については後述する。文書得点の
計算方法は前述した方法（前記段落番号(００４２)〜
(００５５)）を用いる。FIG. 7 is a block diagram of the document score giving means in FIG. 6, in which 41 is a morpheme analyzing means, 42 is an importance setting means, 43 is a document score calculating means, and 44 is a matching degree calculating means. . The morpheme analysis means 41 morphologically analyzes the search word, divides it into words, and determines the part of speech for each word. In the document search device of the present invention, a compound word composed of a plurality of words can be used as a search word input by the user. In the degree-of-importance setting means 42, the degree of importance is a value representing the degree of importance of each word given to each word obtained as a result of morphological analysis of search words. Details of the setting method will be described later. In the document score calculation means 43, the document score is a value representing the degree of coincidence between the registered document and the search word. It is calculated from the matching degree of the search word with each keyword given to the registered document. Here, the degree of coincidence is a value representing the degree of coincidence between each keyword assigned to the registered document and the search word. It is calculated from the importance of each word of the search word, and details of the calculation method will be described later. The document score calculation method is the method described above (paragraph number (0042) to
(0055)) is used.

【００６４】以下に、重要度設定手段と一致度計算手段
について説明する。まず、重要度設定手段について説明
する。重要度設定時には、ユーザの入力した検索語は形
態素解析により単語に分割されている。ｎ（ｎ＞０）個
の単語から構成されている検索語Ｑをｑ₁…ｑ_nと書くこ
ととする。例えば、検索語「文書検索装置」は「文書」
「検索」「装置」の３語から構成されており、ｑ₁＝文
書、ｑ₂＝検索、ｑ₃＝装置となる。検索語に含まれる単
語ｑの重要度をｗ（ｑ）と書くこととする。本発明で
は、単語の重要度はつぎのように与えられる。・検索語の未尾の単語の重要度は、基本点αとする。・未尾以外の単語の重要度は、基本点に未尾からの距離
に位置係数βを乗じた値を加えた値とする。The importance degree setting means and the coincidence degree calculating means will be described below. First, the importance setting means will be described. When the importance level is set, the search word input by the user is divided into words by morphological analysis. A search word Q composed of n (n> 0) words is written as q ₁ ... Q _n . For example, the search term "document search device" is "document"
It is composed of three words, "search" and "device", and q ₁ = document, q ₂ = search, and q ₃ = device. The importance of the word q included in the search word is written as w (q). In the present invention, the importance of a word is given as follows. -The importance of the unsuccessful word in the search term is the basic point α. -The importance of words other than the tail is the value obtained by adding the value obtained by multiplying the distance from the tail by the position coefficient β to the basic point.

【００６５】これを式で書くとつぎのようになる。ｗ（ｑｉ）＝α＋β＊（ｎ−ｉ） …（１）この方式では、従来技術で述べたように検索語を２回走
査する必要がなく、１回の走査で検索語の構成単語全て
に重要度を設定することができる。重要度設定を例で示
す。検索語を「新素材繊維開発」とする。この検索語は
「新」「素材」「繊維」「開発」の４単語に分割され
る。上式のパラメータを、α＝１０，β＝２とした場
合、各単語の重要度は、以下の表２２のようになる。When this is written as an expression, it becomes as follows. w (qi) = α + β * (n−i) (1) In this method, it is not necessary to scan the search word twice as described in the prior art, and all the constituent words of the search word can be scanned once. The degree of importance can be set. The importance setting is shown as an example. The search term is “new material fiber development”. This search term is divided into four words, “new”, “material”, “fiber”, and “development”. When the parameters of the above equation are α = 10 and β = 2, the importance of each word is as shown in Table 22 below.

【００６６】[0066]

【表２２】 [Table 22]

【００６７】前述の方式では、検索語の構成単語数が多
くなると、先頭に近い単語の重要度が高くなる一方なの
で、異なる検索語において先頭単語が同一の場合でも検
索語の構成単語数が多いほどその単語の重要度が高くな
ってしまうという問題がある。請求項２０の方式では、
検索語の構成単語数に応じたバイアスをかけることで、
このような問題点を回避する。すなわち、構成単語数係
数γを導入し、重要度を設定する。ｗ（ｑｉ）＝α＋β＊（ｎ−ｉ）＋γ＊ｎ …（２）とくに、γ＝−βとすれば、先頭単語の重要度が構成単
語数とは独立に、いつも同じ値にできる。先ほどの例で
用いた検索語「新素材繊維開発」に対し、パラメータ
を、α＝１２，β＝２，γ＝−２とした場合、各単語の
重要度は、以下の表２３のようになる。In the above-described method, as the number of constituent words of the search word increases, the importance of the word closer to the head increases, so that the number of constituent words of the search word is large even if the head words of different search words are the same. There is a problem that the degree of importance of the word becomes higher. According to the method of claim 20,
By applying a bias according to the number of constituent words of the search term,
Avoid such problems. That is, the constituent word number coefficient γ is introduced to set the importance. w (qi) = α + β * (n−i) + γ * n (2) Especially, if γ = −β, the importance of the leading word can always be the same value independently of the number of constituent words. If the parameters are α = 12, β = 2, γ = -2 for the search term “new material fiber development” used in the previous example, the importance of each word is as shown in Table 23 below. Become.

【００６８】[0068]

【表２３】 [Table 23]

【００６９】前述の方法では、単語の性質に関わらず同
一の式で重要度を設定していた。しかし、単語の性質に
よって検索用語のとして重要なものとそうでないものが
あり、重要なものには高い重要度を与えることが望まれ
る。例えば、接頭辞などは補助的な役割を果たしている
ので名詞類と比較して一般的に重要度が低い。そこで、
請求項２１の方式では、単語の品詞に応じて重要度の設
定パラメータ（α，β，γ）を変えることを可能とし
た。例えば、名詞類(一般名詞,サ変名詞など)に対する
パラメータを、α[名詞]＝１２，β[名詞]＝２，γ[名
詞]＝−２，接頭辞に対するパラメータを、α[接頭辞]
＝４，β[接頭辞]＝０，γ[接頭辞]＝０とする。このと
き、検索語「新素材繊維開発」の各単語の重要度は、以
下の表２４のようになる。In the above-mentioned method, the importance is set by the same expression regardless of the nature of the word. However, some search terms are important and some are not, depending on the nature of the word, and it is desirable to give high importance to the important ones. For example, prefixes are generally less important than nouns because they play an auxiliary role. Therefore,
According to the method of claim 21, it is possible to change the setting parameters (α, β, γ) of the importance according to the part of speech of the word. For example, parameters for nouns (general nouns, sahen nouns, etc.) are α [noun] = 12, β [noun] = 2, γ [noun] =-2, and parameters for prefixes are α [prefix].
= 4, β [prefix] = 0, and γ [prefix] = 0. At this time, the importance of each word of the search word “new material fiber development” is as shown in Table 24 below.

【００７０】[0070]

【表２４】 [Table 24]

【００７１】前述の方法では、単語の品詞が同じであれ
ば同一の式で重要度を設定していた。しかし、検索用語
として重要か否かは品詞だけで決められるものではな
く、検索システムが対象とする文書の性質などに依存す
る。前述した実施例ではこのような品詞よりも細かい単
語の文法的／意味的な特徴を記述するものとしてキーワ
ード素性を提案している。例えば、繊維関係の文書検索
システムでは繊維に関する名詞は文書に頻出するので、
検索語としては一般的な名詞よりも重要性が低い。そこ
で、「繊維」という名詞に「複合語語基」というキーワ
ード素性を付与して、この単語を他の一般的な名詞から
識別する。そこで、請求項２２の方式では、単語の品詞
だけでなくキーワード素性に応じても重要度の設定パラ
メータ（α，β，γ）を変えることを可能とした。例え
ば、名詞類に対するパラメータをキーワード素性「複合
語語基」の有無によって、α[名詞・素性あり]＝１２，
β[名詞・素性あり]＝２，γ[名詞・素性あり]＝−２，
α[名詞・素性なし]＝１，β[名詞・素性なし]＝１，γ
[名詞・素性なし]＝−１とする。接頭辞に対するパラメ
ータは先ほどと同じとすれば、検索語「新素材繊維開
発」の各単語の重要度は、以下の表２５のようになる。In the above-mentioned method, if the word parts of speech are the same, the importance is set by the same formula. However, whether or not it is important as a search term is not determined only by the part of speech, but depends on the nature of the document targeted by the search system. In the above-mentioned embodiment, the keyword feature is proposed as a description of the grammatical / semantic features of a word smaller than the part of speech. For example, in a fiber-related document retrieval system, nouns related to fibers often appear in documents.
It is less important as a search term than general nouns. Therefore, a keyword feature “compound word base” is added to the noun “fiber” to distinguish this word from other general nouns. Therefore, according to the method of claim 22, it is possible to change the setting parameters (α, β, γ) of the importance according to not only the part of speech of the word but also the keyword feature. For example, α [noun / presence] = 12 depending on the presence / absence of a keyword feature “compound word base” for parameters for nouns
β [noun / presence] = 2, γ [noun / presence] =-2,
α [noun / no feature] = 1, β [noun / no feature] = 1, γ
[Noun / no feature] = -1. Assuming that the parameters for the prefix are the same as before, the importance of each word of the search word "new material fiber development" is as shown in Table 25 below.

【００７２】[0072]

【表２５】 [Table 25]

【００７３】つぎに、一致度計算方式について説明す
る。一致度計算では文書に付与されているうちの１つの
キーワードと索引語の一致の程度を検索語の構成単語に
設定された重要度を用いて計算する。基本的には、キー
ワードと検索語の共通する構成単語に設定されている重
要度の合計をそのキーワードとその検索語の一致度と定
義する。例えば、「新素材繊維開発」を検索語とし、表
２５のように重要度が設定されたとする。ここで、「新
素材」、「新開発」、「合成繊維」の３語をキーワード
として一致度がいくつになるか計算する。Next, the coincidence calculation method will be described. In the matching degree calculation, the degree of matching between one of the keywords added to the document and the index word is calculated using the importance set for the constituent words of the search word. Basically, the sum of the degrees of importance set for the constituent words common to the keyword and the search word is defined as the degree of coincidence between the keyword and the search word. For example, it is assumed that "new material fiber development" is used as the search term and the importance is set as shown in Table 25. Here, using the three words “new material”, “new development”, and “synthetic fiber” as keywords, the degree of coincidence is calculated.

【００７４】１．キーワード：「新素材」（「新」「素
材」が構成単語）このとき、「新」「素材」の２単語が検索語と共通であ
る。一致度＝ｗ（新）＋ｗ（素材）＝４＋８＝１２２．キーワード：「繊維素材開発」（「繊維」「素材」
「開発」が構成単語）このとき、「繊維」「素材」「開発」の３単語が検索語
と共通である。一致度＝ｗ（繊維）＋ｗ（素材）＋ｗ（開発）＝３＋８
＋４＝１５３．キーワード：「合成繊維販売」（「合成」「繊維」
「販売」が構成単語）このとき、「繊維」のみが検索語と共通である。一致度＝ｗ（繊維）＝３1. Keyword: “new material” (“new” and “material” are constituent words) At this time, two words “new” and “material” are common with the search word. Consistency = w (new) + w (material) = 4 + 8 = 12 2. Keywords: "fiber material development"("fiber""material"
“Development” is a constituent word) At this time, the three words “fiber”, “material”, and “development” are common to the search word. Consistency = w (fiber) + w (material) + w (development) = 3 + 8
+ 4 = 15 3. Keywords: "Synthetic fiber sales"("Synthetic""Fiber"
At this time, only “fiber” is common with the search word. Consistency = w (fiber) = 3

【００７５】前述の方法では、複数の単語が検索語とキ
ーワードに共通な場合、それら共通な単語の出現順序に
より異なるか否かの区別ができない。すなわち、検索語
「新素材繊維開発」に対し、キーワードが「素材繊維」
でも「繊維素材」でも一致度は同じになる。しかし、
「素材」「繊維」の出現順序は「繊維素材」と一致して
いるので、「繊維素材」より「素材繊維」の方が一致度
が大きくなるべきである。このため、検索語とキーワー
ドに共通な単語が複数ある場合、それらの単語の順序
（単語並び）が検索語とキーワードで一致する場合にボ
ーナス点を加えるようにした。ボーナス点（以下、「隣
接点」と呼ぶ）は単語並びの一致個数に比例するものと
し、単語並びあたりの隣接詞をδとする。δ＝３とする
と、先ほどと同じ検索語、キーワードに対する一致度は
つぎのようになる。In the above method, when a plurality of words are common to the search word and the keyword, it is impossible to distinguish whether or not they are different depending on the appearance order of the common words. In other words, the keyword is "material fiber" for the search term "new material fiber development".
However, the degree of coincidence is the same for "fiber materials". But,
Since the appearance order of the "material" and "fiber" is the same as that of the "fiber material", the "material fiber" should have a higher degree of coincidence than the "fiber material". Therefore, when there are a plurality of words common to the search word and the keyword, a bonus point is added when the order of the words (word arrangement) is the same in the search word and the keyword. The bonus point (hereinafter, referred to as “adjacent point”) is proportional to the number of matching word sequences, and the adjacency per word sequence is δ. When δ = 3, the degree of coincidence with the same search word and keyword as above is as follows.

【００７６】１．キーワード：「新素材」「新」「素材」の並びが共通である。一致度＝ｗ（新）＋ｗ（素材）＋δ＝４＋８＋３＝１５２．キーワード：「繊維素材開発」３単語が共通だが、単語並びが一致するものはない。一致度＝ｗ（繊維）＋ｗ（素材）＋ｗ（開発）＝３＋８
＋４＝１２前述の方法では、検索語とキーワードが完全に一致した
場合と検索語がキーワードに含まれる場合を区別するこ
とができない。すなわち、検索語「新素材繊維開発」に
対し、キーワードが「新素材繊維開発」であっても「新
素材繊維開発センター」であっても一致度が同じになっ
てしまう。この問題点を解決するため、請求項では検索
語とキーワードの先頭の単語が一致した場合にδ先頭、
請求項では検索語とキーワードの未尾の単語が一致した
場合に、δ[未尾]をボーナス点として加えるようにし
た。δ[先頭]＝δ[未尾]＝２とすると、先ほどと同じ検
索語、キーワードに対する一致度はつぎのようになる。1. Keyword: The sequence of "new material", "new" and "material" is common. Matching degree = w (new) + w (material) + δ = 4 + 8 + 3 = 15 2. Keyword: “textile material development” Three words are common, but there is no one with the same word sequence. Consistency = w (fiber) + w (material) + w (development) = 3 + 8
+ 4 = 12 In the above method, it is not possible to distinguish between the case where the search word and the keyword completely match and the case where the search word is included in the keyword. That is, the degree of coincidence with the search term “new material fiber development” is the same whether the keyword is “new material fiber development” or “new material fiber development center”. To solve this problem, in the claim, when the search word and the first word of the keyword match, the δ head,
In the claims, when the search word and the unsuccessful word of the keyword match, δ [unsuccessful] is added as a bonus point. If δ [start] = δ [untailed] = 2, the degree of coincidence with the same search word and keyword as before is as follows.

【００７７】１．キーワード：「新素材」「新」「素材」の単語並びが共通で、「新」が検索語・
キーワードのどちらでも先頭にある。一致度＝ｗ（新）＋ｗ（素材）＋δ＋δ[先頭]＝４＋８
＋３＋２＝１７２．キーワード：「繊維素材開発」「開発」が検索語・キーワードのどちらでも未尾にあ
る。一致度＝ｗ（繊維）＋ｗ（素材）＋ｗ（開発）＋δ[未
尾]＝３＋８＋４＋２＝１４1. Keyword: The word sequence of "new material", "new", and "material" is common, and "new" is the search term.
Both of the keywords are at the beginning. Matching rate = w (new) + w (material) + δ + δ [start] = 4 + 8
+ 3 + 2 = 17 2. Keywords: "textile material development""Development" is unsuccessful in both search terms and keywords. Consistency = w (fiber) + w (material) + w (development) + δ [untailed] = 3 + 8 + 4 + 2 = 14

【００７８】[0078]

【効果】以上の説明から明らかなように、本発明による
と、以下のような効果がある。（１）請求項１に対応する効果：検索語を形態素解析
し、その結果品詞分解された単語と、文書中の品詞単位
で保存されたキーワードを比較することにより検索語と
文書中の語が完全に一致していなくても検索することが
できる。（２）請求項２に対応する効果：検索語と各文書中のキ
ーワードとの一致度を計算することにより、各文書に検
索語に即した得点を付与することができる。（３）請求項３に対応する効果：検索語に応じて文書に
得点を付与することができるので、検索語に即した文書
から順に出力することができる。（４）請求項４に対応する効果：各文書における検索語
に即した得点とは、単語列の最後尾の単語に基本点を付
与し、単語列の前に遡るに従って点数を上げていき、そ
の点数の合計を文書の得点とする方法なので、単語列の
前に位置する単語ほど高い点数を与えることができる。（５）請求項５に対応する効果：検索語と文書の一致度
の計算について、キーワード素性の１つである複合語語
基を用いることにより、文書に得点を付与する際にキー
ワードとなり得にくい語には高得点をあたえないように
することができる。（６）請求項６に対応する効果：検索語と文書の一致度
の計算について、キーワード素性の１つである固有名詞
構成語を用いることにより、文書に得点を付与する際に
キーワードとなり得にくい語には高得点を与えないよう
にすることができる。（７）請求項７に対応する効果：検索語と文書の一致度
の計算について、キーワード素性の１つである接頭修飾
を用いることにより、特殊な意味をもつ接頭辞には得点
を与えることができる。（８）請求項８に対応する効果：検索語と文書の一致度
の計算について、キーワード素性の１つである地名識別
語を用いることにより、文書に得点を付与する際にキー
ワードとなり得にくい語には高得点を与えないようにす
ることができる。（９）請求項９に対応する効果：検索語と文書の一致度
について、キーワード素性の１つである元号識別語を用
いることにより、文書に得点を付与する際にキーワード
となり得にくい語には高得点を与えないようにすること
ができる。（１０）請求項１０に対応する効果：一致度計算手段で
キーワードに含まれる単語と一致する検索語の単語の重
要度の積を一致度とすることで、一致度を的確に計算で
きる。（１１）請求項１１に対応する効果：一致度の計算に単
語の並び順を考慮に入れることで、一致度を正確に計算
できる。（１２）請求項１２に対応する効果：一致度の計算に検
索語に与えられる重要度に応じた正規化処理を導入する
ことで、一致度を検索語の長さに依存することなく正確
に計算できる。（１３）請求項１３に対応する効果：文書得点を登録文
書のキーワードと検索語の一致度の平均値をすること
で、文書得点を文書内のキーワード数に依存することな
く正確に計算できる。（１４）請求項１４に対応する効果：文書得点を登録文
書のキーワードと検索語の一致度の和を一致度が１以上
となったキーワード数で割った値とすることで、文書得
点を文書内のキーワード数に依存することなく正確に計
算できる。（１５）請求項１５に対応する効果：文書得点を登録文
書のキーワードと検索語の一致度の最大値をすること
で、文書得点を文書内のキーワード数に依存することな
く正確に計算できる。（１６）請求項１６に対応する効果：キーワードの登録
文書中での出現位置によって重みつき一致度および文書
得点が計算されるので、文書得点が従来と比較して的確
なものになる。（１７）請求項１７に対応する効果：キーワードの登録
文書中での後続語によって重みつき一致度および文書得
点が計算されるので、文書得点が従来と比較して的確な
ものになる。（１８）請求項１８〜２２に対応する効果：重要度設定
手段で、検索語の構成単語の位置によってその単語の重
要度が設定されるため、重要度設定が的確に行なえ、検
索精度が向上する。また、検索語の走査が１回で済むた
め、検索速度が向上する。（１９）請求項２３〜２６に対応する効果：一致度計算
手段で、検索語とキーワードの構成単語の順序（単語並
び）が一致度に反映されるため、一致度計算が的確に行
なえ、検索精度語が向上する。また、一致度計算が和演
算のみなので検索速度が向上する。As is apparent from the above description, the present invention has the following effects. (1) Effect corresponding to claim 1: By performing morphological analysis on a search word and comparing a word that has been decomposed as a result of part-of-speech with a keyword stored in a part-of-speech unit in the document, the search word and the word in the document are You can search even if they do not match exactly. (2) Effect corresponding to claim 2: By calculating the degree of coincidence between the search word and the keyword in each document, it is possible to give each document a score according to the search word. (3) Effect corresponding to claim 3: Since it is possible to give a score to a document in accordance with a search word, it is possible to sequentially output documents in accordance with the search word. (4) Effect corresponding to claim 4: The score according to the search word in each document is to give a basic point to the last word of the word string, and increase the score as going back to the front of the word string. Since the total score is used as the score of the document, the higher the score of the word located before the word sequence, the higher the score. (5) Effect corresponding to claim 5: When calculating the degree of coincidence between a search word and a document, it is difficult to use the compound word base, which is one of the keyword features, as a keyword when scoring a document. You can avoid giving high scores to words. (6) Effect corresponding to claim 6: When calculating the degree of coincidence between a search word and a document, it is difficult to use the proper noun constituent word, which is one of the keyword features, as a keyword when giving a score to the document. A word can be given no high score. (7) Effect corresponding to claim 7: For the calculation of the degree of coincidence between the search word and the document, by using the prefix modification which is one of the keyword features, it is possible to give a score to the prefix having a special meaning. it can. (8) Effect corresponding to claim 8: For the calculation of the degree of coincidence between the search word and the document, by using the place name identification word, which is one of the keyword features, it is difficult to use the word as a keyword when scoring the document. You can avoid giving a high score to. (9) Effect corresponding to claim 9: With regard to the degree of coincidence between the search word and the document, by using the era identification word, which is one of the keyword features, it is possible to make the word difficult to become a keyword when giving a score to the document. Can avoid giving high scores. (10) Effect corresponding to claim 10: The degree of coincidence can be accurately calculated by using the degree of coincidence calculation means as the product of the degrees of importance of the words of the search word that match the word included in the keyword. (11) Effect corresponding to claim 11: The degree of coincidence can be accurately calculated by taking into consideration the arrangement order of words in the calculation of the degree of coincidence. (12) Effect corresponding to claim 12: By introducing a normalization process according to the degree of importance given to a search term in the calculation of the degree of match, the degree of match can be accurately determined without depending on the length of the search term. Can be calculated. (13) Effect corresponding to claim 13: The document score can be accurately calculated without depending on the number of keywords in the document by calculating the average value of the coincidence between the keyword of the registered document and the search word. (14) Effect corresponding to claim 14: The document score is obtained by dividing the document score by a value obtained by dividing the sum of the coincidence between the keyword of the registered document and the search word by the number of keywords having a coincidence of 1 or more. It can be calculated accurately without depending on the number of keywords in. (15) Effect corresponding to claim 15: The document score can be accurately calculated without depending on the number of keywords in the document by setting the maximum value of the degree of coincidence between the keyword of the registered document and the search word as the document score. (16) Effect corresponding to claim 16: Since the weighted coincidence degree and the document score are calculated depending on the appearance position of the keyword in the registered document, the document score becomes more accurate than the conventional one. (17) Effect corresponding to claim 17: Since the weighted coincidence degree and the document score are calculated by the subsequent words in the registered document of the keyword, the document score becomes more accurate than the conventional one. (18) Effects corresponding to claims 18 to 22: Since the importance degree setting means sets the importance degree of the word according to the position of the constituent word of the search word, the importance degree can be set accurately and the search accuracy is improved. To do. In addition, the search speed is improved because the search word is scanned once. (19) Effects corresponding to claims 23 to 26: Since the degree of coincidence calculation means reflects the order (word sequence) of the constituent words of the search word and the keyword in the degree of coincidence, the degree of coincidence calculation can be accurately performed and the retrieval can be performed. The precision word is improved. Further, since the coincidence degree calculation is only a sum operation, the search speed is improved.

[Brief description of drawings]

【図１】本発明による文書検索方式の一実施例を説明
するための構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of a document search system according to the present invention.

【図２】図１における文書得点付与手段の動作を説明
するためのフローチャートである。FIG. 2 is a flowchart for explaining the operation of the document score giving means in FIG.

【図３】本発明による検索語に対する重要付与ルール
を説明するためのフローチャートである。FIG. 3 is a flowchart for explaining a rule of giving importance to a search word according to the present invention.

【図４】本発明による文書検索方式の他の実施例を説
明するための構成図である。FIG. 4 is a configuration diagram for explaining another embodiment of the document search system according to the present invention.

【図５】図４における文書得点付与手段を構成図であ
る。5 is a block diagram of the document score giving means in FIG.

【図６】本発明による文書検索方式の更に他の実施例
を説明するための構成図である。FIG. 6 is a configuration diagram for explaining still another embodiment of the document search system according to the present invention.

【図７】図６における文書得点付与手段の構成図であ
る。7 is a configuration diagram of the document score giving means in FIG.

[Explanation of symbols]

１…検索語入力手段、２…文書得点付与手段、３…文書
ランキング手段、４…文書出力手段、５…キーワードか
ら付与された文書。1 ... Search word input means, 2 ... Document score giving means, 3 ... Document ranking means, 4 ... Document output means, 5 ... Documents given from keywords.

Claims

[Claims]

1. A morphological analysis means for morphologically analyzing an input search word, and a comparison means for comparing a part-of-speech decomposed word obtained by the morphological analysis means with a keyword stored in word units in a document. A document search method that consists of and can perform a search even if the search word and the word in the document do not exactly match.

2. The document search system according to claim 1, wherein a score according to the search word is given to each document by calculating the degree of coincidence between the search word and the keyword in each document.

3. The document retrieval system according to claim 2, wherein the documents are scored according to the retrieval word so that the documents can be sequentially output in order from the retrieval word.

4. The score according to the search word in each of the documents is to give a basic point to the last word of the word string of the search word,
3. The document retrieval method according to claim 2, wherein the importance is increased from the basic point as going back to the front of the word string, and the total of the importance is used as the score of the document.

5. The compound word base, which is one of the keyword features, is used for the calculation of the degree of coincidence between the search word and the document, so that a word that is unlikely to be a keyword when scoring a document is high. 3. The document retrieval system according to claim 2, wherein no score is given.

6. The use of proper noun constituent words, which is one of the keyword features, in the calculation of the degree of coincidence between the search word and the document, so that the word that is unlikely to be a keyword when scoring a document is high. 3. The document retrieval system according to claim 2, wherein no score is given.

7. A score is given to a special prefix by using a prefix modification, which is one of the keyword features, in calculating the degree of coincidence between the search word and the document. Document search method described in 2.

8. The calculation of the degree of coincidence between the search word and the document does not give a high score which is unlikely to be a keyword when giving a score to a document, by using a place name identification word, which is one of the keyword features. The document retrieval system according to claim 2, wherein the document retrieval system is configured as described above.

9. An era identification word, which is one of the keyword features, is used in the calculation of the degree of coincidence between the search word and the document, so that the word that is unlikely to be the keyword when scoring the document is high. 3. The document retrieval system according to claim 2, wherein no score is given.

10. A morpheme analysis means for performing a morpheme analysis of an input search word, an importance degree setting means for setting an importance degree to each of the word groups obtained by the morpheme analysis means, and an importance degree assigned to a registered document from the importance degree. Matching degree calculating means for calculating the matching degree of the keyword composed of a group of words, document score calculating means for calculating the document score of the document from the matching degree, and the documents in the document score order by the document score calculating means. A document retrieval method comprising: a document output unit for outputting, and the coincidence degree calculating unit sets the product of the importance levels of the search words that match the word included in the keyword as the coincidence level.

11. The document according to claim 10, wherein the matching degree calculation means increases the matching degree when the word arrangement included in the keyword and the word arrangement included in the search term match. Search method.

12. The document retrieval system according to claim 10, wherein the degree of coincidence when the keyword search word is completely matched by the degree-of-match calculation means does not change according to the number of words included in the search word.

13. The document retrieval method according to claim 10, wherein the document score calculation means uses the average value of the coincidence between the keyword of the registered document and the search word as the document score.

14. The document score is calculated by dividing the sum of the degrees of coincidence between the keyword of the registered document and the search word by the number of keywords having a degree of coincidence of 1 or more by the document score calculation means. Document search method described in 10.

15. The document retrieval system according to claim 10, wherein the document score calculation means uses the maximum value of the degree of coincidence between the keyword of the registered document and the search word as the document score.

16. The document retrieval method according to claim 10, wherein the document score calculation means changes the calculation method of the document score according to the appearance position of the keyword in the document.

17. The document retrieval method according to claim 10, wherein the document score calculation means changes the calculation method of the document score according to the subsequent word of the keyword.

18. A morpheme analysis means for performing a morpheme analysis of an input search word, an importance degree setting means for setting an importance degree to each of the word groups obtained by the morpheme analysis means, and the importance degree setting means. Coincidence degree calculating means for calculating the degree of coincidence with the keyword assigned to the registered document using the degree of importance, and document score calculation for calculating the document score of the document from the coincidence degree calculated by the coincidence degree calculating means Means and a document output means for outputting the documents in order of the document scores by the document score calculation means. By calculating the degree of coincidence between the search word and the keyword in each document, the score corresponding to the search word is given to each document. A document retrieval method characterized by giving documents and outputting documents in the order of their scores.

19. The document retrieval method according to claim 18, wherein the importance setting means sets the importance of the word in accordance with the appearance position of the word.

20. The document retrieval method according to claim 19, wherein when the importance of the word is set by the importance setting means, the importance of the word is set according to the number of constituent words of the search word.

21. The document search method according to claim 19, wherein when the importance of the word is set by the importance setting means, the importance is set according to the part of speech of the word.

22. When setting the importance level of a word by the importance level setting means, the importance level is set according to a keyword feature that describes a grammatical / semantic feature not described by the part of speech of the word. 22. The document search method according to claim 21.

23. When calculating the degree of coincidence between a document keyword and a search word by the degree-of-coincidence calculating means, the total degree of importance of words common to the keyword and the search term is taken as the degree of coincidence. Document search method described in 18.

24. When the degree of coincidence between a document keyword and a search word is calculated by the coincidence degree calculating means, the degree of coincidence is increased when the word sequence included in the keyword and the word sequence included in the search term match. 24. The document search system according to claim 23.

25. When the degree of coincidence between the document keyword and the search word is calculated by the degree-of-coincidence calculating means, the degree of coincidence is increased when the keyword and the unseen word of the search term match. The document search method according to Item 23.

26. When the degree of coincidence between a document keyword and a search word is calculated by the coincidence degree calculating means, the degree of coincidence is increased when the keyword matches the first word of the search term. Document search method described in No. 23.