JP5488424B2

JP5488424B2 - Information processing apparatus, control method therefor, and program

Info

Publication number: JP5488424B2
Application number: JP2010261663A
Authority: JP
Inventors: 靖大田中
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2010-11-24
Filing date: 2010-11-24
Publication date: 2014-05-14
Anticipated expiration: 2030-11-24
Also published as: JP2012113501A

Description

本発明は、情報処理装置、及びその制御方法、プログラムに関し、特に、登録された電子文書の中から、入力された電子文書内の文字列と近似する文字列を含む電子文書を検出する技術に関する。 The present invention relates to an information processing apparatus, a control method therefor, and a program, and more particularly, to a technique for detecting an electronic document including a character string that approximates a character string in an input electronic document from registered electronic documents. .

コンピュータおよびインターネットの普及により、電子文書を取り扱う機会は日々増加している。しかし電子文書は複製が容易なため、電子的な情報漏洩や著作権侵害など様々な問題が多く発生している。 With the spread of computers and the Internet, opportunities to handle electronic documents are increasing day by day. However, since electronic documents can be easily copied, various problems such as electronic information leakage and copyright infringement have occurred.

情報漏洩に関しては、近年ＤＬＰ（ＤａｔａＬｅａｋＰｒｅｖｅｎｔｉｏｎ）といった製品が注目されている。ＤＬＰはネットワーク上やクライアントＰＣ上に設置される情報漏洩対策の製品であり、予め設定された機密文書を特徴付ける条件に基づき、機密文書の外部への送信を制御することができる。 With regard to information leakage, products such as DLP (Data Leak Prevention) have recently attracted attention. DLP is an information leakage countermeasure product installed on a network or on a client PC, and can control transmission of confidential documents to the outside based on conditions that characterize the confidential documents set in advance.

従前より、送受信者や本文中のキーワードなどにより条件を設定しており煩雑な作業が必要であったが、最近は機密文書そのものを登録し、登録した機密文書またはその派生文書が送信されるのを制御する機能を持った製品が現れている。また、大学などでのレポート作成において、インターネット上の文書から部分的に複製を取得し、組み合わせることによりレポートを作成することが問題となっており、これらを検出する製品も登場している。 Previously, conditions were set according to the sender / receiver and keywords in the text and complicated work was required, but recently, confidential documents themselves are registered, and registered confidential documents or their derivative documents are sent. Products with a function to control In addition, when creating reports at universities and the like, it is a problem to create a report by partially copying and combining them from documents on the Internet, and products that detect these have also appeared.

このような製品の実現には、特定の文書に対して、まったく同一の内容を持つ文書だけでなく、部分的な複製や、小規模な編集などにより派生した問合せ文書に近似する文書を見つけ出す技術が必要とされる。 In order to realize such a product, not only a document with exactly the same content as a specific document, but also a technique for finding a document that approximates a query document derived by partial duplication or small-scale editing. Is needed.

従来より指定された文書に似た文書を検索する技術としては、文書の構成要素の一致度合により類似度を算出する類似文検索や、２つの文書を構成する文字列の直接比較することで文書の近似を判定できるＤＰマッチングなどがある。 Conventionally, as a technique for searching for a document similar to a specified document, a similar sentence search for calculating a similarity based on a matching degree of a component of a document, or a direct comparison of character strings constituting two documents is used. There is DP matching that can determine the approximation of.

特許文献１および特開文献２において、連接文字の出現位置に対する転置インデックスを用いて、検索語との一致度合に応じて検索を行う方法が開示されている。 Patent Document 1 and Japanese Patent Application Laid-Open Publication No. 2005-259542 disclose a method of performing a search according to the degree of matching with a search word using a transposed index with respect to the appearance position of a concatenated character.

また、特許文献３においては、文書の構成要素をハッシュ値に変換して縮約した文字列（フィンガープリント）で表現し、フィンガープリントの比較により問合せ文書との一致度合の高い文書を検索する方法が開示されている。 Also, in Patent Document 3, a method of searching for a document having a high degree of coincidence with a query document by expressing the constituent elements of the document as a hash value and expressing the contracted character string (fingerprint) and comparing the fingerprints Is disclosed.

特開平０８−２３５２１２号公報Japanese Patent Laid-Open No. 08-235212 特開平０９−２５９１４０号公報JP 09-259140 A 特表２００８−５４１２７２号公報Special table 2008-541272

しかし、従来の類似文検索では、基本的に文書の構成要素の一致度合により類似度を算出しており、構成要素の順序や位置関係は考慮されておらず、同じような構成要素含めば、記述が大きく異なる文書であっても検出してしまうという問題がある。 However, in the conventional similar sentence search, the similarity is basically calculated based on the degree of coincidence of the components of the document, the order and positional relationship of the components are not considered, and if similar components are included, There is a problem in that even documents with greatly different descriptions are detected.

またＤＰマッチングなどを用いると、２つの文書を構成する文字列の直接比較により文書の近似度合を求めることができるが、検出の対象となる文書が多くなると、文書数に比例して比較回数が増え、処理時間が増大するという問題がある。 When DP matching or the like is used, the degree of approximation of documents can be obtained by direct comparison of character strings constituting two documents. However, when the number of documents to be detected increases, the number of comparisons increases in proportion to the number of documents. There is a problem that the processing time increases.

さらに特許文献１及び特許文献２で開示されている方法は短い検索語を対象としたもので、検索語を構成する全ての連接文字列について評価を行うため、検索語が長くなると処理量が増大し、検索処理の時間が長くなるという問題がある。 Furthermore, the methods disclosed in Patent Document 1 and Patent Document 2 are for short search words, and all connected character strings constituting the search words are evaluated, so the processing amount increases as the search words become longer. However, there is a problem that the time for the search process becomes long.

特許文献３においては、フィンガープリントの生成の際、特徴的な構成要素のみを対象とし、比較箇所を削減することで高速化を図っているが、比較箇所の選択基準がそれぞれの文書ごとの統計値に基づいており、登録文書と問合せ文書（検索語となる文字列を含む文書）で基準が異なっているため、異なる構成要素について比較してしまう可能性がある。特に検索対象となる文書の一部しか含まない文書を問合せ文書とする場合については問題が顕著となる。 In Patent Document 3, only characteristic components are targeted when fingerprints are generated, and the speed is increased by reducing the number of comparison points. However, the selection criteria for comparison points are statistics for each document. Since the standard is different between the registered document and the inquiry document (a document including a character string serving as a search word) based on the value, there is a possibility that different components are compared. In particular, the problem becomes significant when a document including only a part of a document to be searched is used as an inquiry document.

本発明は上記の課題を解決するためになされたものであり、問合せ文書を構成する部分文字列と、同じ位置関係を有する、登録文書内の部分文字列の数に従って、登録文書と問合せ文書との近似度合を算出することで、精度良く近似する登録文書を決定する仕組みを提供することを目的とする。 The present invention has been made in order to solve the above-described problem, and according to the number of partial character strings in the registered document having the same positional relationship as the partial character strings constituting the query document, An object of the present invention is to provide a mechanism for determining a registered document to be approximated with high accuracy by calculating the degree of approximation.

本発明は、指定された文書を示す問合せ文書の近似対象となる文書を示す登録文書に含まれる文章を分解することにより得られる部分文字列と、前記登録文書における当該部分文字列の位置と、を記憶する記憶手段を備え、指定される前記問合せ文書に近似する前記登録文書を決定する情報処理装置であって、前記問合せ文書を分解することにより得られる部分文字列と同じ部分文字列であって、前記記憶手段に記憶された登録文書に含まれる部分文字列を取得する部分文字列取得手段と、前記問合せ文書に対する前記部分文字列の位置及び前記部分文字列取得手段によって取得した登録文書に対する部分文字列の位置から求まる、前記問合せ文書に対する部分文字列と前記登録文書に対する部分文字列と、の同じ位置関係を有するその部分文字列の数を用いて、前記登録文書と前記問合せ文書との近似度合を算出する算出手段と、前記算出手段による算出結果に従って、問合せ文書に近似する登録文書を決定する決定手段と、を備えることを特徴とする。 The present invention provides a partial character string obtained by decomposing a sentence included in a registered document indicating a document to be approximated to an inquiry document indicating a designated document, a position of the partial character string in the registered document, An information processing apparatus for determining the registered document that approximates the designated query document, wherein the partial character string is the same as the partial character string obtained by decomposing the query document. A partial character string acquisition unit for acquiring a partial character string included in the registered document stored in the storage unit, a position of the partial character string with respect to the inquiry document, and a registration document acquired by the partial character string acquisition unit obtained from the position of the partial character string, the portion having a partial string to substrings with the registered document to the inquiry documents, the same positional relationship A calculating unit that calculates the degree of approximation between the registered document and the query document using the number of character strings; and a determining unit that determines a registered document that approximates the query document according to a calculation result by the calculating unit. It is characterized by that.

また、本発明は、指定された文書を示す問合せ文書の近似対象となる文書を示す登録文書に含まれる文章を分解することにより得られる部分文字列と、前記登録文書における当該部分文字列の位置と、を記憶する記憶手段を備え、指定される前記問合せ文書に近似する前記登録文書を決定する情報処理装置の制御方法であって、前記情報処理装置の部分文字列取得手段が、前記問合せ文書を分解することにより得られる部分文字列と同じ部分文字列であって、前記記憶手段に記憶された登録文書に含まれる部分文字列を取得する部分文字列取得工程と、前記情報処理装置の算出手段が、前記問合せ文書に対する前記部分文字列の位置及び前記部分文字列取得工程によって取得した登録文書に対する部分文字列の位置から求まる、前記問合せ文書に対する部分文字列と前記登録文書に対する部分文字列と、の同じ位置関係を有するその部分文字列の数を用いて、前記登録文書と前記問合せ文書との近似度合を算出する算出工程と、前記情報処理装置の決定手段が、前記算出工程による算出結果に従って、問合せ文書に近似する登録文書を決定する決定工程と、を備えることを特徴とする。 The present invention also provides a partial character string obtained by decomposing a sentence included in a registered document indicating a document to be approximated by an inquiry document indicating a specified document, and the position of the partial character string in the registered document. And an information processing apparatus control method for determining the registered document that approximates the designated inquiry document, wherein the partial character string acquisition means of the information processing apparatus includes the inquiry document. A partial character string that is the same as the partial character string obtained by decomposing the character string, and that includes the partial character string included in the registered document stored in the storage unit; and the calculation of the information processing apparatus The query statement is obtained from a position of the partial character string with respect to the query document and a position of the partial character string with respect to the registered document acquired by the partial character string acquisition step. A calculation step of using the number of the partial strings having a partial string to substrings with the registered document, the same positional relationship, to calculate the approximate degree of the registered document and the query document for the information The determination unit of the processing device includes a determination step of determining a registered document that approximates the inquiry document according to the calculation result of the calculation step.

また、本発明は、指定された文書を示す問合せ文書の近似対象となる文書を示す登録文書に含まれる文章を分解することにより得られる部分文字列と、前記登録文書における当該部分文字列の位置と、を記憶する記憶手段を備え、指定される前記問合せ文書に近似する前記登録文書を決定する情報処理装置で読み取り実行可能なプログラムあって、前記情報処理装置を、前記問合せ文書を分解することにより得られる部分文字列と同じ部分文字列であって、前記記憶手段に記憶された登録文書に含まれる部分文字列を取得する部分文字列取得手段と、前記問合せ文書に対する前記部分文字列の位置及び前記部分文字列取得手段によって取得した登録文書に対する部分文字列の位置から求まる、前記問合せ文書に対する部分文字列と前記登録文書に対する部分文字列と、の同じ位置関係を有するその部分文字列の数を用いて、前記登録文書と前記問合せ文書との近似度合を算出する算出手段と、前記算出手段による算出結果に従って、問合せ文書に近似する登録文書を決定する決定手段として機能させることを特徴とする。 The present invention also provides a partial character string obtained by decomposing a sentence included in a registered document indicating a document to be approximated by an inquiry document indicating a specified document, and the position of the partial character string in the registered document. A program that can be read and executed by an information processing apparatus that determines the registered document that approximates the designated query document, and the information processing apparatus disassembles the query document. A partial character string that is the same as the partial character string obtained by the above, and a partial character string acquisition unit that acquires a partial character string included in the registered document stored in the storage unit; and a position of the partial character string with respect to the query document And the partial character string for the query document and the registration sentence obtained from the position of the partial character string for the registered document acquired by the partial character string acquisition means By using the number of the partial strings having a partial character string, the same positional relationship with respect to, and calculating means for calculating an approximate degree between the query document and the registration document, in accordance with the calculated result of said calculation means, query documents It is made to function as a determination means which determines the registration document approximated to.

本発明によれば、近似する文書の検索を高速に実現することができる近似文書検索装置及びその制御方法、プログラムを提供できる。特に、登録文書集合全体における統計値に基づき登録文書の弁別に有効な構成要素を問合せ文書の構成要素から選択するので、問合せ文書が登録文書の一部分しか含まないような場合の検索についても効果を奏する。 According to the present invention, it is possible to provide an approximate document search apparatus, a control method therefor, and a program that can realize a search for an approximate document at high speed. In particular, since effective components for discrimination of registered documents are selected from the components of the query document based on the statistical values in the entire registered document set, it is also effective for searching when the query document contains only a part of the registered document. Play.

本発明の実施形態における近似文書検索装置の構成を示す図である。It is a figure which shows the structure of the approximate document search apparatus in embodiment of this invention. 本発明の実施形態における各種端末のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the various terminals in embodiment of this invention. 本発明の実施形態における近似文書検索装置における文書登録部の基本的な処理フローを示す図である。It is a figure which shows the basic processing flow of the document registration part in the approximate document search apparatus in embodiment of this invention. 本発明の実施形態における登録文書保存領域の構成を示す図である。It is a figure which shows the structure of the registration document preservation | save area | region in embodiment of this invention. 本発明の実施形態における書誌情報テーブルの一例を示す図である。It is a figure which shows an example of the bibliographic information table in embodiment of this invention. 本発明の実施形態における転置インデックスの一例を示す図である。It is a figure which shows an example of the transposition index in embodiment of this invention. 本発明の実施形態における文書登録処理の一例を示す図である。It is a figure which shows an example of the document registration process in embodiment of this invention. 本発明の実施形態における近似検索処理部の基本的な処理フローを示す図である。It is a figure which shows the basic processing flow of the approximate search process part in embodiment of this invention. 本発明の実施形態における近似検索処理部の統計情報取得処理の基本的な処理フローを示す図である。It is a figure which shows the basic processing flow of the statistical information acquisition process of the approximate search process part in embodiment of this invention. 本発明の実施形態における連接文字列の登録文書集合に対する出現確率を求める式の一例を示す図である。It is a figure which shows an example of the type | formula which calculates | requires the appearance probability with respect to the registration document set of the connection character string in embodiment of this invention. 本発明の実施形態における連接文字列の登録文書集合に対する情報量を求める式の一例を示す図である。It is a figure which shows an example of the type | formula which calculates | requires the information content with respect to the registration document set of the connection character string in embodiment of this invention. 本発明の実施形態における特定文字の登録文書集合に対する情報エントロピーを求める式の一例を示す図である。It is a figure which shows an example of the type | formula which calculates | requires the information entropy with respect to the registration document set of the specific character in embodiment of this invention. 本発明の実施形態における転置インデックスに含まれる情報から統計値を求める一例を示す図である。It is a figure which shows an example which calculates | requires a statistics value from the information contained in the transposition index in embodiment of this invention. 本発明の実施形態における近似検索処理部の評価連接文字列選択処理の基本的な処理フローを示す図である。It is a figure which shows the basic processing flow of the evaluation connection character string selection process of the approximate search process part in embodiment of this invention. 本発明の実施形態における評価連接文字列選択処理の一例を示す図である。It is a figure which shows an example of the evaluation connection character string selection process in embodiment of this invention. 本発明の実施形態における評価連接文字列選択処理における位置情報の補正および補正出現位置での集約の一例を示す図である。It is a figure which shows an example of correction | amendment of the positional information in the evaluation concatenated character string selection process in embodiment of this invention, and an aggregation in correction | amendment appearance position. 本発明の実施形態における近似検索処理部の近似度算出処理の基本的な処理フローを示す図である。It is a figure which shows the basic processing flow of the approximation calculation process of the approximate search process part in embodiment of this invention. 本発明の実施形態における近似度を求める式の一例を示す図である。It is a figure which shows an example of the type | formula which calculates | requires the approximation degree in embodiment of this invention. 本発明の実施形態における近似度算出処理の一例を示す図である。It is a figure which shows an example of the approximation calculation process in embodiment of this invention. 本発明の第２の実施形態における近似検索処理部の近似度算出処理の基本的な処理フローを示す図である。It is a figure which shows the basic processing flow of the approximation calculation process of the approximate search process part in the 2nd Embodiment of this invention. 本発明の第２実施形態における近似度を求める式の一例を示す図である。It is a figure which shows an example of the type | formula which calculates | requires the approximation degree in 2nd Embodiment of this invention. 本発明の第２の実施形態における近似度算出処理の一例を示す図である。It is a figure which shows an example of the approximation calculation process in the 2nd Embodiment of this invention. 本発明の第２の実施形態における許容区間を特定する処理のフローを示す図である。It is a figure which shows the flow of the process which specifies the tolerance | permissible area in the 2nd Embodiment of this invention. 本発明の第３の実施形態における近似検索処理部の近似度算出処理の基本的な処理フローを示す図である。It is a figure which shows the basic processing flow of the approximation calculation process of the approximate search process part in the 3rd Embodiment of this invention. 本発明の第３の実施形態における近似度算出処理の一例を示す図である。It is a figure which shows an example of the approximation calculation process in the 3rd Embodiment of this invention.

以下、図面を参照して、本発明の実施形態を詳細に説明する。
図１は本発明の実施形態の近似文書検索装置の構成を示す図である。
尚、図１の近似文書検索装置の構成は一例であり、用途や目的に応じて様々な構成例があることは言うまでもない。
１００は近似文書検索装置である。近似文書検索装置１００は、文書登録部１０１、登録文書情報保存領域１０２、近似文書検索部１０３から構成される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram showing a configuration of an approximate document search apparatus according to an embodiment of the present invention.
Note that the configuration of the approximate document search device in FIG. 1 is an example, and it goes without saying that there are various configuration examples depending on the application and purpose.
Reference numeral 100 denotes an approximate document search apparatus. The approximate document search apparatus 100 includes a document registration unit 101, a registered document information storage area 102, and an approximate document search unit 103.

文書登録部１０１は、近似文書検索装置１００に入力された登録文書１１０を構成する文字列を部分文字列に分解し、全ての部分文字列と出現位置を、登録文書の書誌情報とともに、登録文書情報保存領域１０２に追加する。すなわち、部分文字列は、登録文書に含まれる文章を分解することにより得られる。 The document registration unit 101 decomposes the character string constituting the registered document 110 input to the approximate document search apparatus 100 into partial character strings, and displays all the partial character strings and the appearance positions together with the bibliographic information of the registered document. It is added to the information storage area 102. That is, the partial character string is obtained by decomposing a sentence included in the registered document.

近似文書検索部１０３は、近似文書検索装置１００に入力された問合せ文書１１１を構成する文字列を部分文字列に分解し、登録文書情報保存領域１０２を参照して、全ての部分文字列に対する登録文書における統計情報（後述する情報量や情報エントロピーを含む）や登録文書集合における出現位置情報（出現位置）を取得し、取得した統計情報と出現位置情報に基づき、問合せ文書１１１に近似する文書を登録文書情報保存領域１０２に登録された文書集合から探し出し、近似する文書のリストを近似の度合に応じて近似検索結果１１２として出力する。 The approximate document search unit 103 decomposes the character string constituting the query document 111 input to the approximate document search apparatus 100 into partial character strings, refers to the registered document information storage area 102, and registers all partial character strings. Statistical information (including information amount and information entropy described later) in the document and appearance position information (appearance position) in the registered document set are acquired, and a document that approximates the query document 111 is acquired based on the acquired statistical information and appearance position information. A search is made from the document set registered in the registered document information storage area 102, and a list of approximate documents is output as an approximate search result 112 according to the degree of approximation.

次に、図１の近似文書検索装置１００のハードウェア構成について、図２を用いて説明する。 Next, the hardware configuration of the approximate document search apparatus 100 in FIG. 1 will be described with reference to FIG.

図２は、本発明の実施形態における各種端末のハードウェア構成を示す図である。 FIG. 2 is a diagram illustrating a hardware configuration of various terminals according to the embodiment of the present invention.

ＣＰＵ２０１は、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。 The CPU 201 comprehensively controls each device and controller connected to the system bus 204.

また、ＲＯＭ２０２あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やオペレーティングシステムプログラム（以下、ＯＳ）や、各サーバ或いは各ＰＣの実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。ＲＡＭ２０３は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。 Further, the ROM 202 or the external memory 211 is necessary to realize a BIOS (Basic Input / Output System) or an operating system program (hereinafter referred to as an OS), which is a control program of the CPU 201, or a function executed by each server or each PC. Various programs to be described later are stored. The RAM 203 functions as a main memory, work area, and the like for the CPU 201.

ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＡＭ２０３にロードして、プログラムを実行することで各種動作を実現するものである。 The CPU 201 implements various operations by loading a program necessary for execution of processing into the RAM 203 and executing the program.

また、入力コントローラ（入力Ｃ）２０５は、キーボード２０９や不図示のマウス等のポインティングデバイスからの入力を制御する。 An input controller (input C) 205 controls input from a pointing device such as a keyboard 209 or a mouse (not shown).

ビデオコントローラ（ＶＣ）２０６は、ＣＲＴディスプレイ（ＣＲＴ）２１０等の表示器への表示を制御する。表示器はＣＲＴだけでなく、液晶ディスプレイでも構わない。これらは必要に応じて管理者が使用するものである。本発明には直接関係があるものではない。 A video controller (VC) 206 controls display on a display device such as a CRT display (CRT) 210. The display device may be a liquid crystal display as well as a CRT. These are used by the administrator as needed. The present invention is not directly related.

メモリコントローラ（ＭＣ）２０７は、ブートプログラム、ブラウザソフトウエア、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶するハードディスク（ＨＤ）やフロッピーディスク（登録商標ＦＤ）或いはＰＣＭＣＩＡカードスロットにアダプタを介して接続されるコンパクトフラッシュメモリ等の外部メモリ２１１へのアクセスを制御する。 A memory controller (MC) 207 is a hard disk (HD), floppy disk (registered trademark FD) or PCMCIA card slot for storing boot programs, browser software, various applications, font data, user files, editing files, various data, and the like. Controls access to an external memory 211 such as a compact flash memory connected via an adapter.

通信Ｉ／Ｆコントローラ（通信Ｉ／ＦＣ）２０８は、ネットワークを介して、外部機器と接続・通信するものであり、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰを用いたインターネット通信等が可能である。 A communication I / F controller (communication I / FC) 208 is connected to and communicates with an external device via a network, and executes communication control processing in the network. For example, Internet communication using TCP / IP is possible.

なお、ＣＰＵ２０１は、例えばＲＡＭ２０３内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ＣＲＴ２１０上での表示を可能としている。また、ＣＰＵ２０１は、ＣＲＴ２１０上の不図示のマウスカーソル等でのユーザ指示を可能とする。 Note that the CPU 201 enables display on the CRT 210 by executing outline font rasterization processing on a display information area in the RAM 203, for example. In addition, the CPU 201 enables a user instruction with a mouse cursor (not shown) on the CRT 210.

本発明を実現するための近似文書検索プログラムは外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０３にロードされることによりＣＰＵ２０１によって実行されるものである。さらに、本発明に係わるプログラムが用いる定義ファイル２１３及び各種情報テーブル２１４は外部メモリ２１１に格納されており、これらについての詳細な説明は後述する。 An approximate document search program for realizing the present invention is recorded in the external memory 211 and is executed by the CPU 201 by being loaded into the RAM 203 as necessary. Further, the definition file 213 and various information tables 214 used by the program according to the present invention are stored in the external memory 211, and detailed description thereof will be described later.

（文書登録処理）
次に、近似文書検索装置１００における文書登録部１０１の基本的な処理フローについて、図３を用いて説明する。 (Document registration process)
Next, a basic processing flow of the document registration unit 101 in the approximate document search apparatus 100 will be described with reference to FIG.

図３は文書登録部の文書登録処理における処理フローを示す図である。 FIG. 3 is a diagram showing a processing flow in the document registration processing of the document registration unit.

ステップＳ３０１において、文書登録部１０１は、書誌情報テーブル４０１に登録する文書の書誌情報を登録する。このとき登録文書を一意に識別するための識別子として文書ＩＤを取得する。 In step S 301, the document registration unit 101 registers bibliographic information of a document to be registered in the bibliographic information table 401. At this time, the document ID is acquired as an identifier for uniquely identifying the registered document.

ステップＳ３０２において、文書登録部１０１は、登録文書を構成する文字列を２連接文字列に分割し、２連接文字列に対し文書の識別子である文書ＩＤと２連接文字列の出現位置を合成した位置情報を生成する。
ステップＳ３０３において、文書登録部１０１は、ステップＳ３０２で取得した２連接文字列に対して繰り返し処理を開始する。 In step S302, the document registration unit 101 divides the character string constituting the registered document into two concatenated character strings, and synthesizes the document ID that is the document identifier and the appearance position of the two concatenated character strings with the two concatenated character strings. Generate location information.
In step S303, the document registration unit 101 starts repetitive processing for the two-character string acquired in step S302.

ステップＳ３０４において、文書登録部１０１は、２連接文字列と位置情報を転置インデックス４０２に追加する。登録と同時に転置インデックス４０２に格納されている２連接文字列の出現頻度を１だけ加算する。 In step S 304, the document registration unit 101 adds a two-character string and position information to the transposed index 402. Simultaneously with the registration, the appearance frequency of the two concatenated character strings stored in the transposed index 402 is added by one.

図４は、登録文書情報保存領域１０２の概念図である。登録文書情報保存領域１０２には、書誌情報テーブル４０１と転置インデックス４０２を備えている。 FIG. 4 is a conceptual diagram of the registered document information storage area 102. The registered document information storage area 102 includes a bibliographic information table 401 and a transposed index 402.

ステップＳ３０５において、文書登録部１０１は、処理すべき２連接文字列がまだあれば、処理をステップＳ３０３に戻し、処理すべき２連接文字列がなければ処理を終了する。 In step S305, the document registration unit 101 returns the process to step S303 if there is still a two-character string to be processed, and ends the process if there is no two-character string to be processed.

（ここでの処理の具体例）
次に「本日は晴天なれど大阪湾の波高し。」という一文からなる文書「大阪の天気７月１０日．ｄｏｃ」を登録した場合について具体的に説明する。 (Specific example of processing here)
Next, a case where a document “Osaka weather July 10th. Doc” consisting of a single sentence “Today is a sunny day but the wave height of Osaka Bay” will be described in detail.

ステップＳ３０１において、文書登録部１０１は、図５に示すような書誌情報テーブル４０１に登録する文書の書誌情報として「大阪の天気７月１０日．ｄｏｃ」を登録する。このとき登録文書を一意に識別するための識別子として付与された文書ＩＤ「１５」を取得する。 In step S301, the document registration unit 101 registers “Osaka weather July 10th. Doc” as the bibliographic information of the document to be registered in the bibliographic information table 401 as shown in FIG. At this time, the document ID “15” given as an identifier for uniquely identifying the registered document is acquired.

ステップＳ３０２において、文書登録部１０１は、登録文書を構成する文字列「本日は晴天なれど大阪湾の波高し。」を２連接文字列に分割し、２連接文字列に対し文書の識別子である文書ＩＤと２連接文字列の出現位置を合成した位置情報を生成する。例えば最初の２連接文字列「本日」に対しては、文書ＩＤである「１５」と出現位置「１」から位置情報「１５：１」を生成し、次の２連接文字列「日は」に対していは文書ＩＤである「１５」と出現位置「２」から位置情報「１５：２」を生成する。 In step S 302, the document registration unit 101 divides the character string “Today is a sunny day but the height of Osaka Bay” constituting the registered document into two consecutive character strings, and is a document identifier for the two consecutive character strings. Position information is generated by combining the document ID and the appearance position of the two concatenated character strings. For example, for the first two-character string “today”, position information “15: 1” is generated from the document ID “15” and the appearance position “1”, and the next two-character string “day is”. On the other hand, the position information “15: 2” is generated from the document ID “15” and the appearance position “2”.

ステップＳ３０３において、文書登録部１０１は、最初の２連接文字列「本日」から繰り返し処理を開始する。 In step S 303, the document registration unit 101 starts repetitive processing from the first two concatenated character strings “today”.

ステップＳ３０４において、文書登録部１０１は、２連接文字列「本日」と位置情報「１５：１」を図６に示すような転置インデックス４０２に追加する。転置インデックス４０２において、２連接文字列「本日」に対しては、出現頻度として「１７」、位置情報「…１３：１４１４：３２」が登録されているので、位置情報「１５：１」を追加して「…１３：１４１４：３２１５：１」とし、出現頻度を１だけ加算して「１８」とする。 In step S304, the document registration unit 101 adds the two concatenated character string “today” and the position information “15: 1” to the transposed index 402 as shown in FIG. In the transposed index 402, since the appearance frequency “17” and the position information “... 13:14 14:32” are registered for the two-character string “today”, the position information “15: 1” is set. In addition, “... 13:14 14:32 15: 1” is added, and the appearance frequency is added by 1 to “18”.

ステップＳ３０５において、文書登録部１０１は、次の２連接文字列「日は」があるので処理をステップＳ３０３に戻す。 In step S305, the document registration unit 101 returns the process to step S303 because there is the next two concatenated character string “day is”.

以下同様に全ての２連接文字列に対して処理を繰り返すと、図７に示すように、「大阪の天気７月１０日．ｄｏｃ」に出現した任意の２連接文字列に対し、転置インデックス４０２から出現位置を取得することが可能となる。 Similarly, when the process is repeated for all the two concatenated character strings, as shown in FIG. 7, the transposed index 402 for any two concatenated character strings that appear in “Osaka Weather July 10th. It is possible to acquire the appearance position from

（近似文書検索処理）
次に、近似文書検索装置１００における近似文書検索部１０３の基本的な処理フローについて、図８を用いて説明する。 (Approximate document search process)
Next, a basic processing flow of the approximate document search unit 103 in the approximate document search apparatus 100 will be described with reference to FIG.

図８は近似文書検索部１０３の近似文書検索処理における概略フローを示す図である。 FIG. 8 is a diagram showing a schematic flow in the approximate document search process of the approximate document search unit 103.

ステップＳ８０１において、近似文書検索部１０３は、問合せ文書を構成する２連接文字列および問合せ文書の登録文書集合における統計値を取得する。 In step S 801, the approximate document search unit 103 acquires a two-character string that forms the query document and a statistical value in the registered document set of the query document.

ステップＳ８０２において、近似文書検索部１０３は、Ｓ８０１で取得した２連接文字列および問合せ文書全体の統計情報と予め定められた選択基準に基づき、２連接文字列の文書弁別の寄与度合を判定し、文書弁別の寄与度合の高いものを評価連接文字列として選択する。評価連接文字列に対しては転置インデックス４０２から位置情報を取得し、評価連接文字列の問合せ文書における出現位置で補正する。ここで補正された出現位置を補正位置情報という。 In step S802, the approximate document search unit 103 determines the document discrimination contribution degree of the two concatenated character strings based on the two concatenated character strings and the statistical information of the entire query document acquired in S801 and a predetermined selection criterion. A document with a high contribution degree of document discrimination is selected as an evaluation concatenated character string. For the evaluation concatenated character string, position information is acquired from the transposed index 402, and the evaluation concatenated character string is corrected by the appearance position in the query document. The appearance position corrected here is referred to as corrected position information.

ステップＳ８０３において、近似文書検索部１０３は、Ｓ８０２で取得した評価連接文字列の補正位置情報ごとに集約し、補正位置情報を含む文書ごとに近似度合を算出し、算出した近似度合に基づき、近似する文書を特定する。 In step S803, the approximate document search unit 103 aggregates for each corrected position information of the evaluation concatenated character string acquired in step S802, calculates an approximate degree for each document including the corrected position information, and approximates based on the calculated approximate degree. Identify the document to be used.

本実施の形態においては、文書弁別の寄与度合として、連接文字列の情報量を用いる。また、選択基準として、問合せ文書を構成する文字集合の登録文書集合における乱雑さ度合を示す情報エントロピーを用いる。評価する連接文字列に対する情報量を大きい順に積算し、情報量の積算値が選択基準となる情報エントロピー（乱雑さ度合）を超えたときに判断に必要な情報が得られたと考える。
このように登録文書集合から文書の弁別に寄与する度合が小さい連接文字列の評価を行わないことで弁別効果の低い構成要素に対する処理を削減できる。
ステップＳ８０１〜ステップＳ８０３の詳細については後述する。
（統計情報取得処理）
次に、実施例の近似文書検索処理におけるステップＳ８０１の詳細なフローについて図９を用いて説明する。
図９は近似文書検索処理における統計情報取得処理の詳細なフローを示す図である。
ステップＳ９０１において、近似文書検索部１０３は、問合せ文書を構成する文字列を２連接文字列に分解する。 In the present embodiment, the information amount of the connected character string is used as the contribution degree of document discrimination. In addition, information entropy indicating the degree of randomness in the registered document set of the character set constituting the query document is used as a selection criterion. It is considered that the information necessary for the judgment is obtained when the information amount for the connected character string to be evaluated is accumulated in descending order and the accumulated value of the information amount exceeds the information entropy (degree of randomness) as a selection criterion.
In this way, it is possible to reduce processing for a component having a low discrimination effect by not evaluating a connected character string having a small degree of contribution to document discrimination from the registered document set.
Details of steps S801 to S803 will be described later.
(Statistical information acquisition processing)
Next, a detailed flow of step S801 in the approximate document search process of the embodiment will be described with reference to FIG.
FIG. 9 is a diagram showing a detailed flow of statistical information acquisition processing in approximate document search processing.
In step S 901, the approximate document search unit 103 decomposes the character string that forms the query document into a two-character string.

ステップＳ９０２において、近似文書検索部１０３は、一時領域に保存する問合せ文書の統計情報の基準値を０にセットする。 In step S902, the approximate document search unit 103 sets the reference value of the statistical information of the query document stored in the temporary area to 0.

ステップＳ９０３において、近似文書検索部１０３は、ステップＳ９０１で分割した２連接文字列について繰り返し処理を開始する。 In step S903, the approximate document search unit 103 starts the iterative process for the two concatenated character strings divided in step S901.

ステップＳ９０４において、近似文書検索部１０３は、登録されている全ての連接文字列の総数と、処理中の２連接文字列に対する出現頻度を転置インデックス４０２から取得して、図１０に示す式を用いて２連接文字列の登録文書集合における出現確率Ｐ（Ｗ）を算出し、さらに図１１に示す式を用いて情報量Ｉ（Ｗ）を算出する。また、２連接文字列の先頭文字に対し、図１２に示す式を用いて情報エントロピーＥ（Ｃ）を算出し、２連接文字列が末尾である場合、全ての文字に対して情報エントロピーを算出する。 In step S904, the approximate document search unit 103 acquires the total number of all the connected character strings registered and the appearance frequency for the two connected character strings being processed from the transposed index 402, and uses the formula shown in FIG. Then, the appearance probability P (W) in the registered document set of the two concatenated character strings is calculated, and the information amount I (W) is calculated using the formula shown in FIG. Also, the information entropy E (C) is calculated for the first character of the two-character string using the formula shown in FIG. 12, and when the two-character string is the end, the information entropy is calculated for all characters. To do.

ステップＳ９０５において、近似文書検索部１０３は、ステップＳ９０４で取得した情報エントロピーを一時領域にある基準値に加算する。 In step S905, the approximate document search unit 103 adds the information entropy acquired in step S904 to the reference value in the temporary area.

ステップＳ９０６において、近似文書検索部１０３は、まだ処理すべき２連接文字列があれば、処理をステップＳ９０３に戻す。処理すべき２連接文字列がなければ処理を終了する。 In step S906, the approximate document search unit 103 returns the process to step S903 if there is still a two-character string to be processed. If there is no two concatenated character string to be processed, the process is terminated.

（評価連接文字列選択処理）
次に、実施例の近似文書検索処理におけるステップＳ８０２の処理の詳細なフローについて図１４を用いて説明する。 (Evaluation concatenated character string selection process)
Next, the detailed flow of the process of step S802 in the approximate document search process of the embodiment will be described with reference to FIG.

図１４は近似文書検索処理における評価連接文字列選択処理の詳細なフローを示す図である。 FIG. 14 is a diagram showing a detailed flow of the evaluation concatenated character string selection process in the approximate document search process.

ステップＳ１４０１において、近似文書検索部１０３は、出現位置バッファを格納する領域をメモリー上に確保して初期化する。 In step S1401, the approximate document search unit 103 secures and initializes an area for storing the appearance position buffer in the memory.

ステップＳ１４０２において、近似文書検索部１０３は、ステップＳ８０１で取得した、問合せ文書を構成する連接文字列と統計情報の集合を、文書弁別の寄与度合（連接文字列の情報量）の大きい順に並べ替える。 In step S1402, the approximate document search unit 103 rearranges the concatenated character string and statistical information set, which are acquired in step S801, constituting the query document in descending order of document discrimination contribution degree (concatenated character string information amount). .

ステップＳ１４０３において、近似文書検索部１０３は、並べ替えた連接文字列について繰り返し処理を開始する。 In step S1403, the approximate document search unit 103 starts an iterative process for the rearranged connected character strings.

ステップＳ１４０４において、近似文書検索部１０３は、転置インデックス４０２から連接文字列の登録文書集合における出現位置を取得する。 In step S 1404, the approximate document search unit 103 acquires the appearance position of the connected character string in the registered document set from the transposed index 402.

ステップＳ１４０５において、近似文書検索部１０３は、取得した全ての出現位置から連接文字列が問合せ文書に出現した位置を引いた値を補正出現位置として求める。 In step S 1405, the approximate document search unit 103 obtains, as a corrected appearance position, a value obtained by subtracting the position where the connected character string appears in the query document from all the obtained appearance positions.

補正出現位置は、登録文書において問合せ文書が出現したと仮定した場合の問合せ文書の登録文書上での先頭位置を求めることに等しく、同じ補正出現位置をもつ連接文字列は、登録文書上で同じ問合せ文書を構成している可能性があることを示しており、同じ補正出現位置を持つ連接文字列が多いほど問合せ文書を構成している可能性が高くなる。 The corrected appearance position is equivalent to obtaining the head position on the registered document of the query document when it is assumed that the query document appears in the registered document, and the concatenated character string having the same corrected appearance position is the same on the registered document. This indicates that there is a possibility that the query document is configured. The more connected character strings having the same corrected appearance position, the higher the possibility that the query document is configured.

ステップＳ１４０６において、近似文書検索部１０３は、ステップＳ１４０５で取得した補正出現位置を出現位置バッファに追加する。出現位置バッファに同じ補正出現位置を持つ出現位置情報が登録されていない場合、一致数を１として問合せ文書における出現位置とともに出現位置情報として登録する。出現位置バッファに同一の補正出現位置を持つ出現位置情報が既に登録されている場合、登録されている出現位置情報の一致数を１加算し、連接文字列の問合せ文書における出現位置が、登録済みの問合せ文書における出現位置より小さければ、登録済みの出現位置情報の問合せ文書における出現位置を処理中の連接文字列の問合せ文書における出現位置で更新する。 In step S1406, the approximate document search unit 103 adds the corrected appearance position acquired in step S1405 to the appearance position buffer. When the appearance position information having the same corrected appearance position is not registered in the appearance position buffer, it is registered as the appearance position information together with the appearance position in the inquiry document with a match number of 1. When appearance position information having the same corrected appearance position is already registered in the appearance position buffer, the number of matches of the registered appearance position information is incremented by 1, and the appearance position in the query document of the concatenated character string is already registered If it is smaller than the appearance position in the query document, the appearance position in the query document of the registered appearance position information is updated with the appearance position in the query document of the connected character string being processed.

ステップＳ１４０７において、近似文書検索部１０３は、処理済みの連接文字列の情報量の積算値がステップＳ８０１で取得した選択基準となる情報エントロピーを超えているか否かを判定する。情報量の積算値が情報エントロピーを超えていない場合は処理をステップＳ１４０８に移す。情報量の積算値が情報エントロピーを超えている場合は処理を終了する。 In step S1407, the approximate document search unit 103 determines whether the integrated value of the information amount of the processed connected character string exceeds the information entropy that is the selection criterion acquired in step S801. If the integrated value of the information amount does not exceed the information entropy, the process proceeds to step S1408. If the integrated value of the information amount exceeds the information entropy, the process ends.

ステップＳ１４０８において、近似文書検索部１０３は、まだ処理すべき連接文字列があれば、処理をステップＳ１４０３に戻す。処理すべき連接文字列がなければ処理を終了する。 In step S1408, the approximate document search unit 103 returns the process to step S1403 if there is a connected character string to be processed. If there is no connected character string to be processed, the process ends.

（近似度算出処理）
次に、実施例の近似文書検索処理におけるステップＳ８０３の処理の詳細なフローについて図１７を用いて説明する。 (Approximation degree calculation process)
Next, a detailed flow of the process of step S803 in the approximate document search process of the embodiment will be described with reference to FIG.

図１７は近似文書検索処理における近似度算出処理の詳細なフローを示す図である。 FIG. 17 is a diagram showing a detailed flow of the approximation calculation process in the approximate document search process.

ステップＳ１７０１において、近似文書検索部１０３は、結果バッファの領域をメモリー上に確保して初期化する。 In step S1701, the approximate document search unit 103 secures and initializes the result buffer area in the memory.

ステップＳ１７０２において、近似文書検索部１０３は、文書情報バッファの領域をメモリー上に確保して初期化する。 In step S1702, the approximate document search unit 103 secures and initializes a document information buffer area in the memory.

ステップＳ１７０３において、近似文書検索部１０３は、ステップＳ８０２において出現位置バッファに格納された出現位置情報について繰り返し処理を開始する。 In step S1703, the approximate document search unit 103 starts the iterative process for the appearance position information stored in the appearance position buffer in step S802.

ステップＳ１７０４において、近似文書検索部１０３は、出現位置情報の補正出現位置の一致数が規定値以上であるか否かを判定する。一致数が規定値以上である場合は処理をステップＳ１７０５に移す。一致数が規定値未満である場合は処理をステップＳ１７０６に移す。 In step S1704, the approximate document search unit 103 determines whether the number of matches of the corrected appearance positions of the appearance position information is equal to or greater than a specified value. If the number of matches is greater than or equal to the specified value, the process proceeds to step S1705. If the number of matches is less than the specified value, the process proceeds to step S1706.

ステップＳ１７０５において、近似文書検索部１０３は、出現位置情報を文書情報バッファに追加する。 In step S1705, the approximate document search unit 103 adds appearance position information to the document information buffer.

ステップＳ１７０６において、近似文書検索部１０３は、文書情報バッファが空であれば処理をステップＳ１７１２に移す。文書情報バッファが空でなければ処理をステップＳ１７０７に移す。 In step S1706, if the document information buffer is empty, the approximate document search unit 103 moves the process to step S1712. If the document information buffer is not empty, the process proceeds to step S1707.

ステップＳ１７０７において、近似文書検索部１０３は、現在処理中の出現位置情報が処理順における末尾であるか、または次の出現位置情報と現在処理中の出現位置情報が異なる文書ＩＤを持つか否かを判定する。 In step S1707, the approximate document search unit 103 determines whether the appearance position information currently being processed is the end in the processing order, or whether the next appearance position information and the appearance position information currently being processed have different document IDs. Determine.

現在処理中の出現位置情報が末尾、または現在処理中の出現位置情報次の出現位置情報が異なる文書ＩＤを持つ場合、処理をステップＳ１７０８に移す。現在処理中の出現位置情報が末尾でなく、次の出現位置情報が同じ文書ＩＤを持つ場合は、処理をステップＳ１７１２に移す。 If the appearance position information currently being processed is at the end, or if the next occurrence position information has a different document ID, the process moves to step S1708. If the appearance position information currently being processed is not the end, and the next appearance position information has the same document ID, the process proceeds to step S1712.

ステップＳ１７０８において、近似文書検索部１０３は、文書情報バッファに登録された出現位置情報から図１８に示す式を用いて近似度を算出する。近似度の算出方法については一例であり他の算出方法を用いてもよい。 In step S1708, the approximate document search unit 103 calculates the degree of approximation using the expression shown in FIG. 18 from the appearance position information registered in the document information buffer. The calculation method of the degree of approximation is an example, and other calculation methods may be used.

ステップＳ１７０９において、近似文書検索部１０３は、求めた近似度が規定値以上であるか否かを判定する。近似度が規定値以上である場合、処理をステップＳ１７１０に移す。近似度が規定値未満である場合、処理をステップＳ１７１１に移す。 In step S1709, the approximate document search unit 103 determines whether the obtained degree of approximation is equal to or greater than a specified value. If the degree of approximation is greater than or equal to the specified value, the process proceeds to step S1710. If the degree of approximation is less than the specified value, the process proceeds to step S1711.

ステップＳ１７１０において、近似文書検索部１０３は、文書ＩＤと近似度（算出結果）を紐づけて結果バッファに登録する。 In step S1710, the approximate document search unit 103 associates the document ID with the degree of approximation (calculation result) and registers it in the result buffer.

ステップＳ１７１１において、近似文書検索部１０３は、文書情報バッファを初期化する。 In step S1711, the approximate document search unit 103 initializes the document information buffer.

ステップＳ１７１２において、近似文書検索部１０３は、まだ処理すべき出現位置情報があれば、処理をステップＳ１７０３に戻す。処理すべき出現位置情報がなければ処理をステップＳ１７１３に移す。 In step S1712, if there is appearance position information that should still be processed, the approximate document search unit 103 returns the process to step S1703. If there is no appearance position information to be processed, the process proceeds to step S1713.

ステップＳ１７１３において、近似文書検索部１０３は、結果バッファを近似度の高い順に並び変える。 In step S1713, the approximate document search unit 103 rearranges the result buffers in descending order of approximation.

ステップＳ１７１４において、近似文書検索部１０３は、結果バッファの内容を近似検索結果１１２に格納して処理を終了する。 In step S1714, the approximate document search unit 103 stores the contents of the result buffer in the approximate search result 112 and ends the process.

（ここでの処理の具体例）
次に「本日晴天なれど大阪湾の波高し。」という問合せ文書により近似文検索が行われた場合について具体的に説明する。 (Specific example of processing here)
Next, the case where an approximate sentence search is performed by an inquiry document “Naturally sunny sky but the height of Osaka Bay” will be described in detail.

近似文書検索部１０３は、問合せ文書「本日晴天なれど大阪湾の波高し。」が入力されると、問合せ文書に対して、図９に示す統計情報取得処理を開始する。 The approximate document search unit 103 starts the statistical information acquisition process shown in FIG. 9 for the inquiry document when the inquiry document “Today's fine weather is in Osaka Bay” is input.

ステップＳ９０１において、近似文書検索部１０３は、問合せ文書「本日晴天なれど大阪湾の波高し。」が入力されると、問合せ文書を構成する文字列を分解し、「本日」、「日晴」、「晴天」、「天な」、…、「し。」などの２連接文字列を取得する。 In step S 901, when the query document “Today's fine weather is in Osaka Bay” is input, the approximate document search unit 103 disassembles the character string that forms the query document, and displays “Today” and “Nichiharu”. , “Sunny sky”, “heavenly”,...

ステップＳ９０２において、近似文書検索部１０３は、一時領域に確保した問合せ文書の統計情報の基準値を０にセットする。 In step S902, the approximate document search unit 103 sets the reference value of the statistical information of the query document secured in the temporary area to zero.

ステップＳ９０３において、近似文書検索部１０３は、ステップＳ９０１で分割した最初の２連接文字列「本日」について処理を開始する。このとき転置インデックス４０２が図１３に示すような内容を格納しており、登録されている連接文字列の総数が「５０００」であったとする。 In step S903, the approximate document search unit 103 starts processing for the first two-character string “today” divided in step S901. At this time, it is assumed that the transposition index 402 stores the contents as shown in FIG. 13 and the total number of registered concatenated character strings is “5000”.

ステップＳ９０４において、近似文書検索部１０３は、２連接文字列「本日」に対し、転置インデックス４０２から出現頻度「１８」を得る。連接文字列の総数は「５０００」であるので、図１０に示す式を用いて２連接文字列の全登録文書における「本日」の出現確率を求めるとＰ（本日）＝１８／５０００＝０．００３６を得る。さらに図１１に示す式を用いて２連接文字列「本日」が持つ情報量Ｉ（本日）＝８．１１７８を得る。 In step S904, the approximate document search unit 103 obtains the appearance frequency “18” from the transposed index 402 for the two-character string “today”. Since the total number of concatenated character strings is “5000”, the occurrence probability of “today” in all registered documents of two concatenated character strings is obtained using the formula shown in FIG. 10, and P (today) = 18/5000 = 0. Get 0036. Furthermore, using the equation shown in FIG. 11, the information amount I (today) = 8.1178 of the two-character string “today” is obtained.

また、図１３の転置インデックス４０２において「本」から始まる２連接文字列の出現頻度の総和は２８４であり、近似文書検索部１０３は図１２に示す式を用いて連接文字列の先頭文字「本」に対する情報エントロピーＥ（本）＝２．６２９２１７を取得する。 In the transposed index 402 of FIG. 13, the sum of the appearance frequencies of the two concatenated character strings starting with “book” is 284, and the approximate document search unit 103 uses the formula shown in FIG. Information entropy E (book) = 2.629217.

ステップＳ９０５において、近似文書検索部１０３は、ステップＳ９０４で取得した「本」の情報エントロピーＥ（本）＝２．６２９２１７を一時領域に保存した選択の基準値に加算する。 In step S905, the approximate document search unit 103 adds the “book” information entropy E (book) = 2.629217 acquired in step S904 to the selection reference value stored in the temporary area.

ステップＳ９０６において、近似文書検索部１０３は、次に処理すべき２連接文字列「日晴」があるので、処理をステップＳ９０３に戻す。 In step S906, the approximate document search unit 103 returns the process to step S903 because there is a two-character string “Hibara” to be processed next.

近似文書検索部１０３は、同様に処理を進め、末尾の２連接文字列「し。」まで処理を繰り返すと、問合せ文書を構成する全ての２連接文字列に対して、図１５に示す出現頻度、出現確率、情報量の一覧を取得する。また選択の基準値として、文字の情報エントロピーの合計値である４２．３３７１を得て、図９の統計情報取得処理を終了する。 When the approximate document search unit 103 proceeds in the same manner and repeats the process up to the last two consecutive character strings “Shi”, the appearance frequency shown in FIG. 15 is obtained for all the two consecutive character strings constituting the query document. Get a list of appearance probabilities and information volumes. Moreover, 42.3371 which is the total value of the character information entropy is obtained as a reference value for selection, and the statistical information acquisition process of FIG. 9 is terminated.

次に、近似文書検索部１０３は、統計情報取得処理で取得した図１５に示す出現頻度、出現確率、情報量の一覧と選択基準値４２．３３７１に対して、図１４に示す評価連接文字列選択処理を実施する。 Next, the approximate document search unit 103 uses the evaluation concatenated character string shown in FIG. 14 for the appearance frequency, appearance probability, information amount list and selection reference value 42.3371 shown in FIG. 15 acquired in the statistical information acquisition process. Perform the selection process.

ステップＳ１４０１において、近似文書検索部１０３は、出現位置バッファをメモリー上に確保して空にする。 In step S1401, the approximate document search unit 103 secures the appearance position buffer in the memory and empties it.

ステップＳ１４０２において、近似文書検索部１０３は、図１５に示す出現頻度、出現確率、情報量の一覧を情報量の大きい順に並べ替える。 In step S1402, the approximate document search unit 103 rearranges the list of appearance frequencies, appearance probabilities, and information amounts shown in FIG. 15 in descending order of information amount.

ステップＳ１４０３において、近似文書検索部１０３は、並べ替えた結果、先頭となった連接文字列「阪湾」に対する処理を開始する。 In step S 1403, the approximate document search unit 103 starts processing for the connected character string “Hanwan” that is the head as a result of the rearrangement.

ステップＳ１４０４において、近似文書検索部１０３は、転置インデックス４０２から連接文字列「阪湾」の登録文書集合における出現位置として「…１３：２３１４：４３１５：１０」を取得する。 In step S 1404, the approximate document search unit 103 acquires “... 13:23 14:43 15:10” as the appearance position in the registered document set of the concatenated character string “Sakawan” from the transposed index 402.

ステップＳ１４０５において、近似文書検索部１０３は、図１６に示すように、取得した出現位置「…１３：２３１４：４３１５：１０」から連接文字列が問合せ文書に出現した位置「９」を引いて補正出現位置として「…１３：１４１４：３４１５：１」を取得する。 In step S1405, the approximate document search unit 103 subtracts the position “9” where the concatenated character string appears in the query document from the acquired appearance position “... 13:23 14:43 15:10” as shown in FIG. Then, “... 13:14 14:34 15: 1” is acquired as the corrected appearance position.

ステップＳ１４０６において、近似文書検索部１０３は、ステップＳ１４０５で取得した補正出現位置「…１３：１４１４：３４１５：１」を出現位置バッファに追加する。出現位置「１３：１４」については、問合せ文書出現位置「９」，一致数「１」とともに出現位置情報として出現位置バッファに登録する。同様に出現位置「１４：３４」および「１５：１」に対しても問合せ文書出現位置「９」，一致数「１」とともに出現位置情報を登録する。 In step S1406, the approximate document search unit 103 adds the corrected appearance position “... 13:14 14:34 15: 1” acquired in step S1405 to the appearance position buffer. The appearance position “13:14” is registered in the appearance position buffer as the appearance position information together with the inquiry document appearance position “9” and the number of matches “1”. Similarly, for the appearance positions “14:34” and “15: 1”, the appearance position information is registered together with the inquiry document appearance position “9” and the number of matches “1”.

ステップＳ１４０７において、近似文書検索部１０３は、処理済みの連接文字列の情報量の積算値は９．９６５８となり選択基準となる情報エントロピー「４２．３３７１」を超えないので処理をステップＳ１４０８に移す。 In step S1407, the approximate document search unit 103 moves the process to step S1408 because the integrated value of the information amount of the processed concatenated character string is 9.9658 and does not exceed the information entropy “42.3371” serving as the selection criterion.

ステップＳ１４０８において、近似文書検索部１０３は、次に処理すべき連接文字列「ど大」があるので処理をステップＳ１４０３に戻す。 In step S1408, the approximate document search unit 103 returns the process to step S1403 because there is a connected character string “Large” to be processed next.

ステップＳ１４０３において、近似文書検索部１０３は、連接文字列「ど大」に対する処理を開始する。 In step S1403, the approximate document search unit 103 starts processing for the concatenated character string “Large”.

ステップＳ１４０４において、近似文書検索部１０３は、転置インデックス４０２から連接文字列「ど大」の登録文書集合における出現位置として「…１２：３３１４：２５１５：８」を取得する。 In step S 1404, the approximate document search unit 103 acquires “... 12:33 14:25 15: 8” as the appearance position in the registered document set of the concatenated character string “Large” from the transposed index 402.

ステップＳ１４０５において、近似文書検索部１０３は、取得した出現位置「…１２：３３１４：２５１５：８」から連接文字列が問合せ文書に出現した位置「７」を引いて補正出現位置として「…１２：２６１４：１８１５：１」を取得する。 In step S1405, the approximate document search unit 103 subtracts the position “7” where the concatenated character string appears in the query document from the acquired appearance position “... 12:33 14:25 15: 8” as the corrected appearance position “. 12:26 14:18 15: 1 ”.

ステップＳ１４０６において、近似文書検索部１０３は、ステップＳ１４０５で取得した補正出現位置「…１２：２６１４：１８１５：１」を出現位置バッファに追加する。出現位置「１２：２６」および「１４：１８」については、問合せ文書出現位置「７」，一致数「１」とともに出現位置情報を出現位置バッファに登録する。出現位置「１５：１」に対しては、出現位置情報が既に登録されているので、登録済みの出現位置情報に対し、問合せ文書出現位置「９」をより小さい問合せ文書出現位置「７」に置き換え、一致数「１」に１を加算して「２」とする。 In step S1406, the approximate document search unit 103 adds the corrected appearance position “... 12:26 14:18 15: 1” acquired in step S1405 to the appearance position buffer. For the appearance positions “12:26” and “14:18”, the appearance position information is registered in the appearance position buffer together with the inquiry document appearance position “7” and the number of matches “1”. Since the appearance position information has already been registered for the appearance position “15: 1”, the inquiry document appearance position “9” is changed to a smaller inquiry document appearance position “7” with respect to the registered appearance position information. Replace, add 1 to the number of matches “1” to make “2”.

ステップＳ１４０７において、近似文書検索部１０３は、処理済みの連接文字列の情報量の積算値は９．９６５８＋９．１１７８＝１９．０８３６となり、選択基準となる情報エントロピー「４２．３３７１」を超えないので処理をステップＳ１４０８に移す。 In step S1407, the approximate document search unit 103 has an integrated value of the information amount of the processed concatenated character string of 9.9658 + 9.1178 = 19.0836, and does not exceed the information entropy “42.3371” serving as the selection criterion. The process moves to step S1408.

近似文書検索部１０３は、以下同様に連接文字列「波高」、「晴天」と処理し、「本日」まで処理すると処理済みの連接文字列の情報量の積算値は９．９６５８＋９．１１７８＋８．８２８３＋８．４８０４＋８．１１７８＝４４．５１００となり、選択基準となる情報エントロピー「４２．３３７１」を超えるので繰り返し処理を終了し、図１９に示す状態の出現位置バッファを取得し、図１４に示す評価連接文字列選択処理を終了する。 The approximate document search unit 103 processes the concatenated character strings “wave height” and “sunny sky” in the same manner, and when the processing is performed up to “today”, the integrated value of the information amount of the processed concatenated character string is 9.9658 + 9.1178 + 8.8283 + 8. .4804 + 8.1178 = 44.5100, which exceeds the information entropy “42.3371” serving as the selection criterion, so that the iterative process is terminated, the appearance position buffer in the state shown in FIG. 19 is obtained, and the evaluation concatenated character shown in FIG. End the column selection process.

次に、近似文書検索部１０３は、図１９に示す状態の出現位置バッファに対し、図１７に示す近似度算出処理を実施する。 Next, the approximate document search unit 103 performs an approximation calculation process shown in FIG. 17 on the appearance position buffer in the state shown in FIG.

ステップＳ１７０１において、近似文書検索部１０３は、結果バッファの領域をメモリー上に確保して空にする。 In step S1701, the approximate document search unit 103 secures the result buffer area in the memory and empties it.

ステップＳ１７０２において、近似文書検索部１０３は、文書情報バッファの領域をメモリー上に確保して空にする。 In step S1702, the approximate document search unit 103 secures an area of the document information buffer in the memory and empties it.

ステップＳ１７０３において、近似文書検索部１０３は、出現位置バッファに格納されている先頭の出現位置情報「３：２５」について処理を開始する。 In step S 1703, the approximate document search unit 103 starts processing for the first appearance position information “3:25” stored in the appearance position buffer.

ステップＳ１７０４において、近似文書検索部１０３は、出現位置情報の補正出現位置「３：２５」の一致数「１」が規定値（所定値）「３」未満であるので処理をステップＳ１７０６に移す。 In step S1704, the approximate document search unit 103 moves the process to step S1706 because the number of matches “1” of the corrected appearance position “3:25” of the appearance position information is less than the specified value (predetermined value) “3”.

ステップＳ１７０６において、近似文書検索部１０３は、文書情報バッファが空なので処理をステップＳ１７１２に移す。 In step S1706, the approximate document search unit 103 moves the process to step S1712 because the document information buffer is empty.

ステップＳ１７１２において、近似文書検索部１０３は、次の出現位置情報「１２：２６」があるので処理をステップＳ１７０３に戻す。 In step S1712, the approximate document search unit 103 returns the process to step S1703 because there is the next appearance position information “12:26”.

近似文書検索部１０３は、出現位置情報「１２：２６」「１３：１４」「１４：１８」「１４：３４」「１５：０」に対しては同様の処理を繰り返す。 The approximate document search unit 103 repeats the same processing for the appearance position information “12:26”, “13:14”, “14:18”, “14:34”, and “15: 0”.

ステップＳ１７０３において、近似文書検索部１０３は、出現位置バッファに格納されている先頭の出現位置情報「１５：１」について処理を開始する。 In step S 1703, the approximate document search unit 103 starts processing for the first appearance position information “15: 1” stored in the appearance position buffer.

ステップＳ１７０４において、近似文書検索部１０３は、出現位置情報の補正出現位置「１５：１」の一致数「４」が規定値「３」以上であるので処理をステップＳ１７０５に移す。 In step S1704, the approximate document search unit 103 moves the process to step S1705 because the number of matches “4” of the corrected appearance position “15: 1” of the appearance position information is equal to or greater than the specified value “3”.

ステップＳ１７０５において、近似文書検索部１０３は、出現位置情報「１５：１」を文書情報バッファに追加する。 In step S1705, the approximate document search unit 103 adds the appearance position information “15: 1” to the document information buffer.

ステップＳ１７０６において、近似文書検索部１０３は、文書情報バッファに出現位置情報「１５：１」が登録されており、空ではないので処理をステップＳ１７０７に移す。 In step S1706, the approximate document search unit 103 moves the process to step S1707 because the appearance position information “15: 1” is registered in the document information buffer and is not empty.

ステップＳ１７０７において、近似文書検索部１０３は、現在処理中の出現位置情報「１５：１」の文書ＩＤは「１５」であり、次の出現位置情報「１６：５」の文書ＩＤは「１６」であり文書ＩＤが異なるので、ステップＳ１７０８に処理を移す。 In step S1707, the approximate document search unit 103 has the document ID of the appearance position information “15: 1” currently being processed as “15” and the document ID of the next appearance position information “16: 5” as “16”. Since the document ID is different, the process proceeds to step S1708.

ステップＳ１７０８において、近似文書検索部１０３は、文書情報バッファに登録された出現位置情報から図１８に示す式を用いて近似度を算出し、近似度０．７５を得る。 In step S1708, the approximate document search unit 103 calculates an approximation degree from the appearance position information registered in the document information buffer using the formula shown in FIG. 18, and obtains an approximation degree of 0.75.

ステップＳ１７０９において、近似文書検索部１０３は、求めた近似度０．７５が近似度の規定値０．５以上であるので、処理をステップＳ１７１０に移す。 In step S1709, the approximate document search unit 103 moves the process to step S1710 because the obtained degree of approximation 0.75 is equal to or greater than the specified value 0.5 of the degree of approximation.

ステップＳ１７１０において、近似文書検索部１０３は、文書ＩＤ「１５」と近似度０．７５を紐づけて結果バッファに登録する。 In step S1710, the approximate document search unit 103 associates the document ID “15” with the degree of approximation 0.75, and registers it in the result buffer.

ステップＳ１７１１において、近似文書検索部１０３は、文書情報バッファを空にする。 In step S1711, the approximate document search unit 103 empties the document information buffer.

ステップＳ１７１２において、近似文書検索部１０３は、次の出現位置情報「１６：５」があるので処理をステップＳ１７０３に戻す。 In step S1712, the approximate document search unit 103 returns the process to step S1703 because there is next appearance position information “16: 5”.

ステップＳ１７０３において、近似文書検索部１０３は、出現位置バッファに格納されている先頭の出現位置情報「１６：５」について処理を開始する。 In step S 1703, the approximate document search unit 103 starts processing for the first appearance position information “16: 5” stored in the appearance position buffer.

ステップＳ１７０４において、近似文書検索部１０３は、出現位置情報の補正出現位置「１６：５」の一致数「３」が規定値「３」以上であるので処理をステップＳ１７０５に移す。 In step S1704, the approximate document search unit 103 moves the process to step S1705 because the number of matches “3” of the corrected appearance position “16: 5” of the appearance position information is equal to or greater than the specified value “3”.

ステップＳ１７０５において、近似文書検索部１０３は、出現位置情報「１６：５」を文書情報バッファに追加する。 In step S1705, the approximate document search unit 103 adds the appearance position information “16: 5” to the document information buffer.

ステップＳ１７０６において、近似文書検索部１０３は、文書情報バッファに出現位置情報「１６：５」が登録されており、空ではないので処理をステップＳ１７０７に移す。 In step S1706, the approximate document search unit 103 moves the process to step S1707 because the appearance position information “16: 5” is registered in the document information buffer and is not empty.

ステップＳ１７０７において、近似文書検索部１０３は、現在処理中の出現位置情報「１６：５」の文書ＩＤは１６であり、次の出現位置情報「１９：３２」の文書ＩＤは１９であり文書ＩＤが異なるので、ステップＳ１７０８に処理を移す。 In step S1707, the approximate document search unit 103 sets the document ID of the appearance position information “16: 5” currently being processed to 16, the document ID of the next appearance position information “19:32” is 19, and the document ID. Are different, the process proceeds to step S1708.

ステップＳ１７０８において、近似文書検索部１０３は、文書情報バッファに登録された出現位置情報から図１８に示す式を用いて近似度を算出し、近似度０．５を得る。 In step S1708, the approximate document search unit 103 calculates an approximation degree from the appearance position information registered in the document information buffer using the formula shown in FIG. 18, and obtains an approximation degree of 0.5.

ステップＳ１７０９において、近似文書検索部１０３は、求めた近似度０．５が近似度の規定値０．５以上であるので、処理をステップＳ１７１０に移す。 In step S1709, the approximate document search unit 103 moves the process to step S1710 because the obtained degree of approximation 0.5 is equal to or greater than the prescribed value 0.5 of the degree of approximation.

ステップＳ１７１０において、近似文書検索部１０３は、文書ＩＤ「１６」と近似度０．５を紐づけて結果バッファに登録する。 In step S1710, the approximate document search unit 103 associates the document ID “16” with the degree of approximation of 0.5 and registers it in the result buffer.

ステップＳ１７１２において、近似文書検索部１０３は、次の出現位置情報「１９：３２」があるので処理をステップＳ１７０３に戻す。 In step S1712, the approximate document search unit 103 returns the process to step S1703 because there is next appearance position information “19:32”.

近似文書検索部１０３は、出現位置情報「１９：３２」以降も処理を繰り返し、全ての出現位置情報を処理して、ステップＳ１７１３に処理を移す。 The approximate document search unit 103 repeats the process after the appearance position information “19:32”, processes all the appearance position information, and moves the process to step S1713.

ステップＳ１７１３において、近似文書検索部１０３は、結果バッファに登録された文書ＩＤ「１５」（近似度０．７５）と文書ＩＤ「１６」（近似度０．５）を近似度の高い順に並び変える。 In step S1713, the approximate document search unit 103 rearranges the document ID “15” (approximation 0.75) and document ID “16” (approximation 0.5) registered in the result buffer in descending order of approximation. .

ステップＳ１７１４において、近似文書検索部１０３は、結果バッファの内容を検索結果として文書ＩＤ「１５」（近似度０．７５）と文書ＩＤ「１６」（近似度０．５）を近似検索結果１１２に格納して処理を終了する。 In step S 1714, the approximate document search unit 103 sets the document ID “15” (approximation 0.75) and the document ID “16” (approximation 0.5) as the approximate search result 112 using the contents of the result buffer as the search result. Store and finish the process.

（第２の実施形態）
（位置関係のずれを許容した近似度算出処理） (Second Embodiment)
(Approximation degree calculation process that allows positional deviation)

次に、実施例として、問合せ文書において選択された連接文字列の位置関係と、対応する登録文書における連接文字列の位置関係のずれが許容値以内である場合についても一致区間と判定する近似文書検索装置について図２０を用いて説明する。 Next, as an example, an approximate document that determines a matching section even when the positional relationship between the connected character strings selected in the query document and the positional relationship between the connected character strings in the corresponding registered document are within an allowable value The search device will be described with reference to FIG.

本実施の形態は、第１の実施の形態と近似度算出処理のみが異なる。 This embodiment is different from the first embodiment only in the approximation calculation processing.

図２０は問合せ文書において選択された連接文字列の位置関係と、対応する登録文書における連接文字列の位置関係のずれが許容値以内である場合も一致したと判定する近似度算出の処理フローを示す図である。 FIG. 20 shows a processing flow for calculating the degree of approximation in which it is determined that the positional relationship between the connected character strings selected in the inquiry document and the positional relationship between the connected character strings in the corresponding registered document are within the allowable values. FIG.

図２０は第２の実施の形態における、近似文書検索処理の近似度算出処理の詳細なフローを示す図である。 FIG. 20 is a diagram showing a detailed flow of the approximation calculation process of the approximate document search process in the second embodiment.

ステップＳ１７０１からステップＳ１７０３までの処理は、第１の実施の形態における近似度算出処理（図１７）と同様である。 The processing from step S1701 to step S1703 is the same as the approximation calculation processing (FIG. 17) in the first embodiment.

ステップＳ２００１において、近似文書検索部１０３は、現在の出現位置情報を起点として許容区間を特定する。許容区間は予め規定された許容値以内のずれ（補正位置情報の差）で連続する出現位置情報の集合として取得される。具体的には許容区間と判断した先頭と末尾の位置情報として取得される。このとき同時に許容区間に含まれる出現位置情報の一致数の総和も同時に求める。許容区間特定処理の詳細は後述する。 In step S2001, the approximate document search unit 103 specifies an allowable section using the current appearance position information as a starting point. The permissible section is acquired as a set of appearance position information that continues with a deviation within a predetermined permissible value (difference in correction position information). Specifically, it is acquired as position information of the head and the end determined as the allowable section. At the same time, the total sum of the numbers of coincidence of the appearance position information included in the allowable section is obtained at the same time. Details of the allowable section specifying process will be described later.

ステップＳ２００２において、近似文書検索部１０３は、ステップＳ２００１で求めた許容区間に含まれる出現位置情報の一致数の総和が規定値以上であるか判定する。一致数が規定値以上である場合は処理をステップＳ２００３に移す。一致数が規定値未満である場合は処理をステップＳ２００４に移す。 In step S2002, the approximate document search unit 103 determines whether the sum of the numbers of matches of the appearance position information included in the allowable section obtained in step S2001 is equal to or greater than a specified value. If the number of matches is greater than or equal to the specified value, the process proceeds to step S2003. If the number of matches is less than the specified value, the process proceeds to step S2004.

ステップＳ２００３において、近似文書検索部１０３は、許容区間に含まれる全ての出現位置情報を文書情報バッファに追加する。 In step S2003, the approximate document search unit 103 adds all appearance position information included in the allowable section to the document information buffer.

ステップＳ２００４において、近似文書検索部１０３は、許容区間末尾の出現位置情報を現在処理中の出現位置情報としてセットする。 In step S2004, the approximate document search unit 103 sets the appearance position information at the end of the allowable section as the appearance position information currently being processed.

ステップＳ１７０６からステップＳ１７０７までの処理は、第１の実施の形態における近似度算出処理（図１７）と同様である。 The processing from step S1706 to step S1707 is similar to the approximation calculation processing (FIG. 17) in the first embodiment.

ステップＳ１７０８において、近似文書検索部１０３は、文書情報バッファに登録された出現位置情報から図２１に示す式を用いて近似度を算出する。近似度の算出式は一例であり、ずれの量に応じて重み付けを行うように構成してもよい。 In step S1708, the approximate document search unit 103 calculates the degree of approximation using the expression shown in FIG. 21 from the appearance position information registered in the document information buffer. The formula for calculating the degree of approximation is an example, and weighting may be performed according to the amount of deviation.

ステップＳ１７０９からステップＳ１７１４までの処理は、第１の実施の形態における近似度算出処理（図１７）と同様である。 The processing from step S1709 to step S1714 is the same as the approximation calculation processing (FIG. 17) in the first embodiment.

（許容区間特定処理）
次に、ステップＳ２００１における許容区間特定処理の詳細について図２３を用いて説明する。 (Permissible section identification process)
Next, details of the allowable section specifying process in step S2001 will be described with reference to FIG.

図２３はステップＳ２００１における許容区間特定の処理フローを示す図である。 FIG. 23 is a diagram showing a process flow for specifying the allowable section in step S2001.

ステップＳ２３０１において、近似文書検索部１０３は、許容区間全体の一致数の総和を０にセットする。 In step S 2301, the approximate document search unit 103 sets 0 as the total sum of the number of matches in the entire allowable section.

ステップＳ２３０２において、近似文書検索部１０３は、起点として与えられた出現位置情報を許容区間の先頭としてセットする。 In step S2302, the approximate document search unit 103 sets the appearance position information given as the starting point as the head of the allowable section.

ステップＳ２３０３において、近似文書検索部１０３は、起点として与えられた出現位置情報から繰り返し処理を開始する。 In step S2303, the approximate document search unit 103 starts the iterative process from the appearance position information given as the starting point.

ステップＳ２３０４において、近似文書検索部１０３は、処理中の出現位置情報の一致数を許容区間全体の一致数の総和に加算する。 In step S2304, the approximate document search unit 103 adds the number of matches of the appearance position information being processed to the total number of matches of the entire allowable section.

ステップＳ２３０５において、近似文書検索部１０３は、処理中の出現位置情報を許容区間の末尾としてセットする。 In step S2305, the approximate document search unit 103 sets the appearance position information being processed as the end of the allowable section.

ステップＳ２３０６において、近似文書検索部１０３は、処理中の出現位置情報と次の出現位置情報の補正出現位置の差を位置のずれとして求める。補正出現位置が属する文書が異なる場合は、計算機において表現可能な最大値など許容値を超える値を位置のずれの値とする。 In step S2306, the approximate document search unit 103 obtains a difference between the corrected appearance position of the appearance position information being processed and the next appearance position information as a position shift. When the document to which the corrected appearance position belongs is different, a value exceeding the allowable value such as a maximum value that can be expressed by a computer is set as a position deviation value.

ステップＳ２３０７において、近似文書検索部１０３は、位置ずれの値が許容値以内であるか否かを判定する。許容値以内である場合、ステップＳ２３０８に処理を移す。許容値を超える場合、繰り返し処理を中断し、許容区間特定処理を終了する。 In step S2307, the approximate document search unit 103 determines whether or not the value of the positional deviation is within an allowable value. If it is within the allowable value, the process proceeds to step S2308. When the allowable value is exceeded, the iterative process is interrupted and the allowable section specifying process is terminated.

ステップＳ２３０８において、近似文書検索部１０３は、出現位置情報のずれの数に１をセットする。ずれの数ではなく、ずれの量を保持し、ずれの量に応じた重み付けにより近似度を求めるように構成してもよい。 In step S2308, the approximate document search unit 103 sets 1 to the number of appearance position information shifts. Instead of the number of deviations, the amount of deviation may be held, and the degree of approximation may be obtained by weighting according to the amount of deviation.

ステップＳ２３０９において、近似文書検索部１０３は、まだ処理すべき出現位置情報があれば、処理をステップＳ２３０３に戻す。処理すべき出現位置情報がなければ処理を終了する。 In step S2309, if there is appearance position information that should still be processed, the approximate document search unit 103 returns the process to step S2303. If there is no appearance position information to be processed, the process ends.

（ここでの処理の具体例）
次に第１の実施の形態における具体例と同じ「本日晴天なれど大阪湾の波高し。」という問合せ文書により近似文検索が行われた場合について説明する。 (Specific example of processing here)
Next, a description will be given of a case where an approximate sentence search is performed using the same inquiry document “Today's clear sky but Osaka Bay wave height” as in the specific example of the first embodiment.

第１の実施の形態と同様に近似文書検索部１０３は、ステップＳ８０１およびステップＳ８０２を処理した結果、図２２に示すような出現位置バッファを取得する。 Similar to the first embodiment, the approximate document search unit 103 obtains an appearance position buffer as shown in FIG. 22 as a result of processing step S801 and step S802.

近似文書検索部１０３は、図２２に示す状態の出現位置バッファに対し、図２０に示す近似度算出処理を実施する。 The approximate document search unit 103 performs an approximation calculation process shown in FIG. 20 on the appearance position buffer in the state shown in FIG.

近似文書検索部１０３は、出現位置情報（補正出現位置）「３：２５」「１２：２６」「１３：１４」「１４：１８」「１４：３４」に対しては同様の処理を繰り返す。 The approximate document search unit 103 repeats the same processing for the appearance position information (corrected appearance position) “3:25” “12:26” “13:14” “14:18” “14:34”.

ステップＳ１７０３において、近似文書検索部１０３は、出現位置バッファに格納されている先頭の出現位置情報「１５：０」について処理を開始する。 In step S 1703, the approximate document search unit 103 starts processing for the first appearance position information “15: 0” stored in the appearance position buffer.

ステップＳ２００１において、近似文書検索部１０３は、出現位置情報「１５：０」を起点に許容区間と許容区間の一致数の総和を図２３に示す許容区間特定処理により求め、許容区間の先頭として「１５：０」を、許容区間（所定位置の範囲）の末尾として「１５：１」を、許容区間の一致数の総和として「５」を得る。 In step S2001, the approximate document search unit 103 obtains the sum of the number of matches between the allowable section and the allowable section from the appearance position information “15: 0” as a starting point by the allowable section specifying process shown in FIG. “15: 1” is obtained as “15: 1” as the end of the permissible section (predetermined position range), and “5” is obtained as the total sum of the numbers of coincidence of the permissible sections.

ステップＳ２００２において、近似文書検索部１０３は、出現位置情報「１５：０」を起点とする許容区間の一致数の総和「５」が規定値「３」以上であるので、ステップＳ２００３に処理を移す。 In step S2002, the approximate document search unit 103 moves the process to step S2003 because the sum “5” of the number of coincidences of allowable sections starting from the appearance position information “15: 0” is equal to or greater than the specified value “3”. .

ステップＳ２００３において、近似文書検索部１０３は、許容区間に含まれる出現位置情報「１５：０」と出現位置情報「１５：１」を文書情報バッファに追加する。 In step S2003, the approximate document search unit 103 adds the appearance position information “15: 0” and the appearance position information “15: 1” included in the allowable section to the document information buffer.

ステップＳ２００４において、近似文書検索部１０３は、許容区間の末尾である出現位置情報「１５：１」を処理中の出現位置情報としてセットする。 In step S2004, the approximate document search unit 103 sets the appearance position information “15: 1” at the end of the allowable section as the appearance position information being processed.

ステップＳ１７０６において、近似文書検索部１０３は、文書情報バッファに出現位置情報「１５：０」および出現位置情報「１５：１」が登録されており、空ではないので処理をステップＳ１７０７に移す。 In step S1706, the approximate document search unit 103 has the appearance position information “15: 0” and the appearance position information “15: 1” registered in the document information buffer, and is not empty, the process proceeds to step S1707.

ステップＳ１７０８において、近似文書検索部１０３は、文書情報バッファに登録された出現位置情報から図２１に示す式を用いて近似度を算出し、近似度０．９７５を得る。 In step S1708, the approximate document search unit 103 calculates an approximation degree from the appearance position information registered in the document information buffer using the formula shown in FIG. 21, and obtains an approximation degree of 0.975.

ステップＳ１７０９において、近似文書検索部１０３は、求めた近似度０．９７５が近似度の規定値０．５以上であるので、処理をステップＳ１７１０に移す。 In step S1709, the approximate document search unit 103 moves the process to step S1710 because the obtained approximation degree 0.975 is equal to or greater than the prescribed value 0.5 of the approximation degree.

ステップＳ１７１０において、近似文書検索部１０３は、文書ＩＤ「１５」と近似度０．９７５を紐づけて結果バッファに登録する。 In step S1710, the approximate document search unit 103 associates the document ID “15” with the degree of approximation of 0.975 and registers it in the result buffer.

ステップＳ１７１０において、近似文書検索部１０３は、文書情報バッファを空にする。 In step S1710, the approximate document search unit 103 empties the document information buffer.

以下同様な処理を行い、近似文書検索部１０３は、問合せ文書に近似する文書として、文書ＩＤ「１５」（近似度０．９７５）と文書ＩＤ「１６」（近似度０．５）と文書ＩＤ「２２」（近似度０．７２５）を近似検索結果１１２に格納して処理を終了する。 Thereafter, similar processing is performed, and the approximate document search unit 103 sets the document ID “15” (approximation level 0.975), document ID “16” (approximation level 0.5), and document ID as documents that approximate the query document. “22” (degree of approximation 0.725) is stored in the approximate search result 112 and the process is terminated.

文書ＩＤ「１６」より記述としては問合せ文書に近いと考えられる文書ＩＤ「２２」が第１の実施の形態では検出できなかったが、本実施の形態によれば、文書ＩＤ「１６」よりも高い近似度で検出できるようになり、人間の感覚に近い結果が得られる。 The document ID “22” that is considered to be closer to the inquiry document than the document ID “16” cannot be detected in the first embodiment. It becomes possible to detect with a high degree of approximation, and a result close to human sense is obtained.

（第３の実施形態）
（重複区間除外）
次に、問合せ文書において選択された連接文字列の位置関係と一致する連接文字列の位置関係が１つの登録文書に複数存在する場合に、一致箇所をいずれか一つだけに限定する近似文書検索装置について図２４を用いて説明する。 (Third embodiment)
(Excluding overlapping sections)
Next, when there are a plurality of positional relationships of connected character strings that match the positional relationship of the connected character strings selected in the query document, an approximate document search that limits only one matching portion to one registered document. The apparatus will be described with reference to FIG.

図２４は第３の実施の形態における、近似文書検索処理の近似度算出処理の詳細なフローを示す図である。 FIG. 24 is a diagram showing a detailed flow of the approximation calculation process of the approximate document search process in the third embodiment.

図２４は問合せ文書において選択された連接文字列の位置関係と一致する連接文字列の位置関係が１つの登録文書に複数存在する場合に、一致箇所をいずれか一つだけに限定する近似度算出の処理フローを示す図である。 FIG. 24 shows an approximation calculation for limiting the number of matching points to only one when there are a plurality of positional relationships of concatenated character strings that match the positional relationship of the concatenated character strings selected in the query document. It is a figure which shows the processing flow.

ステップＳ２４０１において、近似文書検索部１０３は、現在処理中の出現位置情報の問合せ文書出現位置を持つ出現位置情報が文書情報バッファに登録されているか否かを判定する。登録されている場合、ステップＳ２４０２に処理を移す。登録されていない場合、ステップＳ１７０５に処理を移す。 In step S2401, the approximate document search unit 103 determines whether or not appearance position information having an inquiry document appearance position of the appearance position information currently being processed is registered in the document information buffer. If registered, the process proceeds to step S2402. If not registered, the process proceeds to step S1705.

ステップＳ２４０２において、近似文書検索部１０３は、現在処理中の出現位置情報の一致数が、現在処理中の出現位置情報と同一の問合せ文書出現位置を持つ出現位置情報の一致数より大きいか否かを判定する。現在処理中の出現位置情報の一致数のほうが大きい場合、ステップＳ２４０３に処理を移す。文書バッファに登録されている出現位置情報の一致数のほうが大きい場合、ステップＳ１７０６に処理を移す。 In step S2402, the approximate document search unit 103 determines whether the number of matches of the appearance position information currently being processed is greater than the number of matches of the appearance position information having the same query document appearance position as the appearance position information currently being processed. Determine. If the number of matches of the appearance position information currently being processed is larger, the process proceeds to step S2403. If the number of matches of the appearance position information registered in the document buffer is larger, the process proceeds to step S1706.

ステップＳ２４０３において、近似文書検索部１０３は、現在処理中の出現位置情報の問合せ文書出現位置を持つ出現位置情報を文書情報バッファから削除し、現在処理中の出現位置情報を文書情報バッファに登録する。 In step S2403, the approximate document search unit 103 deletes the appearance position information having the query document appearance position of the appearance position information currently being processed from the document information buffer, and registers the appearance position information currently being processed in the document information buffer. .

ステップＳ１７０５からステップＳ１７１４までの処理は、第１の実施の形態における近似度算出処理（図１７）と同様である。 The processing from step S1705 to step S1714 is the same as the approximation calculation processing (FIG. 17) in the first embodiment.

第１の実施の形態と同様に近似文書検索部１０３は、ステップＳ８０１およびステップＳ８０２を処理した結果、図２５に示すような出現位置バッファを取得する。 Similar to the first embodiment, the approximate document search unit 103 obtains an appearance position buffer as shown in FIG. 25 as a result of processing step S801 and step S802.

近似文書検索部１０３は、出現位置情報「３：２５」〜「２５：５」までは、第１の実施の形態と同様の処理を繰り返す。 The approximate document search unit 103 repeats the same processing as in the first embodiment for the appearance position information “3:25” to “25: 5”.

ステップＳ１７０３において、近似文書検索部１０３は、出現位置バッファに格納されている先頭の出現位置情報「３９：２９」について処理を開始する。 In step S 1703, the approximate document search unit 103 starts processing for the first appearance position information “39:29” stored in the appearance position buffer.

ステップＳ１７０４において、近似文書検索部１０３は、出現位置情報の補正出現位置「３９：２９」の一致数「３」が規定値「３」以上であるので処理をステップＳ２４０１に移す。 In step S1704, the approximate document search unit 103 moves the process to step S2401 because the matching number “3” of the corrected appearance position “39:29” of the appearance position information is equal to or greater than the specified value “3”.

ステップＳ２４０１において、近似文書検索部１０３は、出現位置情報の補正出現位置「３９：２９」の問合せ文書出現位置「１７」を持つ出現位置情報が文書情報バッファにあるか否かを判定する。 In step S2401, the approximate document search unit 103 determines whether or not the document information buffer has the appearance position information having the inquiry document appearance position “17” of the corrected appearance position “39:29” of the appearance position information.

問合せ文書出現位置「１７」を持つ出現位置情報が文書情報バッファにはないので、ステップＳ１７０５に処理を移す。 Since there is no appearance position information having the inquiry document appearance position “17” in the document information buffer, the process proceeds to step S1705.

ステップＳ１７０５において、近似文書検索部１０３は、出現位置情報「３９：２９」を問合せ文書出現位置「１７」とともに文書情報バッファに追加する。 In step S1705, the approximate document search unit 103 adds the appearance position information “39:29” together with the query document appearance position “17” to the document information buffer.

ステップＳ１７０６において、近似文書検索部１０３は、文書情報バッファに出現位置情報「３９：２９」が登録されており、空ではないので処理をステップＳ１７０７に移す。 In step S1706, the approximate document search unit 103 moves the process to step S1707 because the appearance position information “39:29” is registered in the document information buffer and is not empty.

ステップＳ１７０７において、近似文書検索部１０３は、現在処理中の出現位置情報「３９：２９」の文書ＩＤは「３９」であり、次の出現位置情報「３９：４９」の文書ＩＤは「３９」であり文書ＩＤが等しいので、ステップＳ１７１２に処理を移す。 In step S1707, the approximate document search unit 103 sets the document ID of the appearance position information “39:29” currently being processed to “39” and the document ID of the next appearance position information “39:49” to “39”. Since the document IDs are the same, the process proceeds to step S1712.

ステップＳ１７１２において、近似文書検索部１０３は、次の出現位置情報「３９：４９」があるので処理をステップＳ１７０３に戻す。 In step S1712, the approximate document search unit 103 returns the process to step S1703 because there is next appearance position information “39:49”.

ステップＳ１７０３において、近似文書検索部１０３は、出現位置バッファに格納されている先頭の出現位置情報「３９：４９」について処理を開始する。 In step S 1703, the approximate document search unit 103 starts processing for the first appearance position information “39:49” stored in the appearance position buffer.

ステップＳ１７０４において、近似文書検索部１０３は、出現位置情報の補正出現位置「３９：４９」の一致数「３」が規定値「３」以上であるので処理をステップＳ２４０１に移す。 In step S1704, the approximate document search unit 103 moves the process to step S2401 because the number of matches “3” of the corrected appearance position “39:49” of the appearance position information is equal to or greater than the specified value “3”.

ステップＳ２４０１において、近似文書検索部１０３は、出現位置情報の補正出現位置「３９：４９」の問合せ文書出現位置「１７」を持つ出現位置情報が文書情報バッファにあるか否かを判定する。 In step S2401, the approximate document search unit 103 determines whether or not the appearance information having the query document appearance position “17” of the corrected appearance position “39:49” of the appearance position information is in the document information buffer.

問合せ文書出現位置「１７」を持つ出現位置情報「３９：２９」が文書情報バッファにあるので、ステップＳ２４０２に処理を移す。 Since the appearance position information “39:29” having the inquiry document appearance position “17” is in the document information buffer, the process proceeds to step S2402.

ステップＳ２４０２において、近似文書検索部１０３は、処理中の出現位置情報「３９：４９」の一致数「３」と文書情報バッファに登録されている出現位置情報「３９：２９」の一致数「３」を比較し、処理中の出現位置情報「３９：４９」が大きくはないので、ステップＳ１７０６に処理を移す。 In step S2402, the approximate document search unit 103 matches the number of matches “3” of the appearance position information “39:49” being processed with the number of matches “3” of the appearance position information “39:29” registered in the document information buffer. ”And the appearance position information“ 39:49 ”being processed is not large, and the process proceeds to step S1706.

以降、第１の実施の形態と同様の処理を行うと、文書ＩＤ「３９」の文書に対して近似度０．５が得られる。 Thereafter, when processing similar to that of the first embodiment is performed, a degree of approximation of 0.5 is obtained for the document with the document ID “39”.

文書ＩＤ「３９」は問合せ文書に対し、記述として合致する箇所がそれほど多くないにも関わらず、第１の実施の形態に示した近似文書検索の結果において、文書ＩＤ「３９」は問合せ文書と完全に合致する場合と同じ近似度１．０となってしまう。これは問合せ文書に合致する箇所に対し、登録文書において複数回出現しているためである。本実施の形態においては複数回出現した場合に一度しか評価しないので、文書ＩＤ「３９」に対しても近似度０．５となり、人間の感覚に近い結果が得ることができる。 Although the document ID “39” does not match the query document as much as the description, the document ID “39” is the query document in the result of the approximate document search shown in the first embodiment. The degree of approximation is 1.0, which is the same as in the case of perfect match. This is because the location that matches the query document appears multiple times in the registered document. In the present embodiment, since the evaluation is performed only once when it appears a plurality of times, the degree of approximation is also 0.5 for the document ID “39”, and a result close to a human sense can be obtained.

以上、本実施の形態によれば、問合せ文書を構成する部分文字列と、同じ位置関係を有する、登録文書内の部分文字列の数に従って、登録文書と問合せ文書との近似度合を算出することで、精度良く近似する登録文書を決定することができる。 As described above, according to the present embodiment, the degree of approximation between the registered document and the query document is calculated according to the number of partial character strings in the registered document that have the same positional relationship as the partial character strings constituting the query document. Thus, it is possible to determine a registered document that approximates with high accuracy.

また、問合せ文書に出現する構成要素に対する登録文書集合全体における統計値に基づき、登録文書の弁別に有効な構成要素を選択することで、高精度で高速な近似文書の検索を実現する仕組みを提供することをできる。 In addition, based on the statistical value of the entire registered document set for the components that appear in the query document, a mechanism is provided that enables high-precision and high-speed search for approximate documents by selecting valid components for discrimination of registered documents. I can do it.

以上、本発明の実施形態を詳述したが、本発明は、例えば、システム、装置、方法、中継処理装置で読み取り実行可能なプログラムもしくは記憶媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 The embodiment of the present invention has been described in detail above. However, the present invention can take an embodiment as a system, an apparatus, a method, a program that can be read and executed by a relay processing apparatus, a storage medium, or the like. Specifically, the present invention may be applied to a system composed of a plurality of devices, or may be applied to an apparatus composed of a single device.

また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。 Another object of the present invention is to supply a storage medium storing software program codes for realizing the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the storage medium. Needless to say, this can also be achieved by reading and executing the program code stored in.

この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、プログラムコード自体及びそのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭ等を用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼動しているＯＳ（基本システム或いはオペレーティングシステム）などが実際の処理の一部又は全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (basic system or operating system) running on the computer based on the instruction of the program code. Needless to say, a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.

さらに、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部又は全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program code read from the storage medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function is determined based on the instruction of the program code. It goes without saying that the CPU or the like provided in the expansion board or function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

１００近似文書検索装置
１０１文書登録部
１０２登録文書情報保存領域
１０３近似文書検索部
１１０登録文書
１１１問合せ文書
１１２近似検索結果 DESCRIPTION OF SYMBOLS 100 Approximate document search apparatus 101 Document registration part 102 Registered document information storage area 103 Approximate document search part 110 Registration document 111 Query document 112 Approximate search result

Claims

A memory for storing a partial character string obtained by decomposing a sentence included in a registered document indicating a document to be approximated by a query document indicating a specified document, and a position of the partial character string in the registered document An information processing apparatus for determining the registered document that approximates the designated inquiry document, comprising:
A partial character string that is the same as the partial character string obtained by decomposing the query document, and that acquires a partial character string included in the registered document stored in the storage means; and
The same partial character string for the query document and partial character string for the registered document, obtained from the position of the partial character string with respect to the query document and the position of the partial character string with respect to the registered document acquired by the partial character string acquisition unit Calculating means for calculating the degree of approximation between the registered document and the inquiry document using the number of partial character strings having a positional relationship;
Information amount calculation means for calculating the information amount of the partial character string obtained according to the appearance probability obtained from the appearance frequency of the partial character string of the registered document stored in the storage means for the partial character string of the inquiry document When,
Information for calculating the information entropy of the registered document obtained according to the appearance frequency for the first character of the partial character string of the query document included in the registered document stored in the storage unit for the partial character string of the query document Entropy calculating means,
A determination unit that determines a registered document that approximates an inquiry document according to a calculation result by the calculation unit;
With
The calculation means is integrated each time the positional relationship between the partial character string for the query document and the partial character string for the registered document is calculated in descending order of the information amount of the partial character string calculated by the information amount calculation means. The information processing apparatus , wherein the partial character string acquisition unit, the calculation unit, and the determination unit are executed until the information amount exceeds the calculated information entropy .

The calculation means performs the calculation by setting the positional relationship of a partial character string with respect to the registered document and the positional relationship of the partial character string with respect to the query document as a positional relationship in which a deviation is within an allowable range as the same positional relationship. The information processing apparatus according to claim 1.

The information processing apparatus according to claim 1, wherein the calculation unit uses a number of the positional relationships that matches as the number of the partial character strings.

A correction means for correcting the position of the partial character string of the registered document stored in the storage means according to the position of the partial character string of the inquiry document;
4. The calculation unit according to claim 1, wherein when the correction position of the registered document obtained by correction by the correction unit is the same position, the calculation unit is used as the number having the same positional relationship. The information processing apparatus according to claim 1.

The calculating means uses, as the number of partial character strings, the number of partial character strings having the same positional relationship at the correction position where the number of partial character strings determined to have the same positional relationship is equal to or greater than a predetermined value. The information processing apparatus according to claim 4 .

It said determining means, a registration document approximation degree calculated is equal to or larger than a predetermined value by the calculation means, to any one of claims 1 to 5, characterized in that to determine the registration document, which approximates the query document The information processing apparatus described.

The calculation means calculates the number of the partial character strings when the positional relationship between the partial character string for the query document and the partial character string for the registered document has the same positional relationship within a predetermined position range. the information processing apparatus according to any one of claims 1 to 6, characterized by using as few.

The calculation means has the same positional relationship when the correction position difference of the registered document obtained by being corrected by the correction means for a partial character string with respect to the inquiry document is within an allowable value. as the number, the information processing apparatus according to any one of claims 4 to 7, characterized in that to calculate.

The information processing apparatus according to claim 8 , wherein the calculating unit calculates the degree of approximation between the registered document and the inquiry document according to a difference between the number having the same positional relationship and a correction position.

When the correction position of the registered document obtained by correction by the correction unit is different for partial character strings at the same position included in the inquiry document, the calculation unit has partial character strings having the same positional relationship. wherein the number of often used as the number of the partial strings information processing apparatus according to any one of claims 4 to 9, wherein.

A memory for storing a partial character string obtained by decomposing a sentence included in a registered document indicating a document to be approximated by a query document indicating a specified document, and a position of the partial character string in the registered document A method of controlling an information processing apparatus for determining the registered document that approximates the designated inquiry document, comprising:
The partial character string acquisition means of the information processing device is the same partial character string as the partial character string obtained by decomposing the query document, and the partial character string included in the registered document stored in the storage means A partial character string acquisition step to be acquired;
The calculation means of the information processing apparatus obtains the partial character string for the inquiry document and the registration obtained from the position of the partial character string for the inquiry document and the position of the partial character string for the registered document acquired by the partial character string acquisition step. A calculation step of calculating the degree of approximation between the registered document and the query document using the number of the partial character strings having the same positional relationship with the partial character strings with respect to the document;
The partial character obtained by the information amount calculation means of the information processing apparatus according to the appearance probability obtained from the appearance frequency of the partial character string of the registered document stored in the storage means for the partial character string of the inquiry document An information amount calculating step for calculating the information amount of the column;
The information entropy calculating means of the information processing apparatus targets the partial character string of the inquiry document according to the appearance frequency with respect to the first character of the partial character string of the inquiry document included in the registered document stored in the storage means. An information entropy calculating step of calculating information entropy of the obtained registered document;
A determination step in which the determination unit of the information processing apparatus determines a registered document that approximates an inquiry document according to a calculation result of the calculation step;
Run
The calculation step is integrated each time the positional relationship between the partial character string for the query document and the partial character string for the registered document is calculated in descending order of the information amount of the partial character string calculated by the information amount calculation step. The control method of the information processing apparatus , wherein the partial character string acquisition step, the calculation step, and the determination step are executed until the information amount exceeds the calculated information entropy .

A memory for storing a partial character string obtained by decomposing a sentence included in a registered document indicating a document to be approximated by a query document indicating a specified document, and a position of the partial character string in the registered document A program that can be read and executed by an information processing apparatus that determines the registered document that approximates the specified inquiry document,
The information processing apparatus;
A partial character string that is the same as the partial character string obtained by decomposing the query document, and that acquires a partial character string included in the registered document stored in the storage means; and
The same partial character string for the query document and partial character string for the registered document, obtained from the position of the partial character string with respect to the query document and the position of the partial character string with respect to the registered document acquired by the partial character string acquisition unit Calculating means for calculating the degree of approximation between the registered document and the inquiry document using the number of partial character strings having a positional relationship;
Information amount calculation means for calculating the information amount of the partial character string obtained according to the appearance probability obtained from the appearance frequency of the partial character string of the registered document stored in the storage means for the partial character string of the inquiry document When,
Information for calculating the information entropy of the registered document obtained according to the appearance frequency for the first character of the partial character string of the query document included in the registered document stored in the storage unit for the partial character string of the query document Entropy calculating means,
A determination unit that determines a registered document that approximates an inquiry document according to a calculation result by the calculation unit ;
To function ,
The calculation means is integrated each time the positional relationship between the partial character string for the query document and the partial character string for the registered document is calculated in descending order of the information amount of the partial character string calculated by the information amount calculation means. Until the amount of information exceeds the calculated information entropy, the partial character string acquisition unit, the calculation unit, and the determination unit function as executions .