JP2008077252A

JP2008077252A - Document ranking method, document retrieval method, document ranking device, document retrieval device, and recording medium

Info

Publication number: JP2008077252A
Application number: JP2006253606A
Authority: JP
Inventors: Eiji Kenmochi; 栄治剣持; Atsuo Shimada; 敦夫嶋田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2006-09-19
Filing date: 2006-09-19
Publication date: 2008-04-03

Abstract

<P>PROBLEM TO BE SOLVED: To rank documents based on a criterion associated with volume and density of topics of the documents. <P>SOLUTION: In a step S1, document input process is carried out for inputting documents. In a step S2, using a document database 11, morpheme analysis process is carried out for extracting morphemes from the documents. In a step S3, a simple sentence extraction process is carried out for extracting simple sentences from the documents. In a step S4, simple sentence similarity calculation process is carried out for calculating similarity between the simple sentences extracted by the simple sentence extraction process. In a step S5, using the document database 11, similar simple sentence concatenation set extraction process is carried out for extracting similar simple sentence concatenation set groups based on a similarity relationship between the simple sentences and appearance positional information of the simple sentences inside the documents. In a step S6, an inter-document score calculation processing is carried out for calculating a score of the document based on the extracted similar simple sentence concatenation set groups, and then, the processes are finished. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、文書ランキング方法、文書検索方法、文書ランキング装置、文書検索装置、及び記録媒体に関わり、文書検索や文書分類など自然言語処理をベースとした文書処理システムに好適なものである。 The present invention relates to a document ranking method, a document retrieval method, a document ranking device, a document retrieval device, and a recording medium, and is suitable for a document processing system based on natural language processing such as document retrieval and document classification.

近年、ＷＷＷ（World Wide Web）などのインターネット技術の発達に伴い、容易に大量の文書データにアクセスすることが可能になり、大量の文書データの中から興味のある文書データのみを探し出す技術として様々な文書検索技術が提案・利用されてきている。
例えば、代表的なWorld Wide Web（以下、「ＷＷＷ」と称する）上の検索システムであるｇｏｏｇｌｅの検索結果に見られるように、検索結果はユーザの利便性のため、問い合わせ語との類似性に応じてランキングされ、提示される。 In recent years, with the development of Internet technologies such as the World Wide Web (WWW), it has become possible to easily access a large amount of document data, and there are various techniques for searching only interesting document data from a large amount of document data. Document retrieval technology has been proposed and used.
For example, as seen in the search result of Google, which is a search system on a typical World Wide Web (hereinafter referred to as “WWW”), the search result is similar to the query word for the convenience of the user. It will be ranked and presented accordingly.

しかしながら、特許文献１において指摘されているように、検索の問い合わせ語数が少なかったり、一般的な語句である場合は大量の検索結果が取得されてしまったり、また急速なＷＷＷの発達により問い合わせ語に適合する文書の数が本質的に増加しているなどの理由により、検索結果もまた大量になってしまい、結果として検索結果を有効に活用することが困難な状況が増えてきている。
このような問題を解決するための方法として、特許文献１、２等があり、検索結果の文書間のリンク構造解析により検索結果を再ランキングすることで、適切な検索結果を提供する手法を提案している。即ち、前述の発明では、検索システムなどで集められた一定の基準（類似検索システムの場合は、問い合わせ語と検索対象文書とのマッチング類似度）を満たす文書集合に対し、異なる基準で再ランキングすることによって、検索結果利用の利便性が高くなることが示されている。
米国特許第６７２５２５９号米国特許第６７３８６７８号特許第３６１２５９７号特開２００４−２３４２８４公報 However, as pointed out in Patent Document 1, if the number of search query words is small, or if the word is a general word, a large amount of search results are obtained, and the query word is changed due to rapid WWW development. Search results also become large due to reasons such as the number of matching documents being essentially increased, and as a result, it is becoming increasingly difficult to effectively use the search results.
As a method for solving such a problem, there are Patent Documents 1 and 2, etc., and a method for providing an appropriate search result by re-ranking the search result by analyzing a link structure between documents of the search result is proposed. is doing. In other words, in the above-described invention, a document set satisfying a certain criterion collected by a search system or the like (in the case of a similar search system, matching similarity between a query word and a search target document) is re-ranked according to a different criterion. This indicates that the convenience of using search results is improved.
US Pat. No. 6,725,259 US Pat. No. 6,738,678 Japanese Patent No. 3612597 JP 2004-234284 A

ところで、例えば、図１は“郵政民営化ａｎｄ取りまとめ”という問い合わせ語で、ｇｏｏｇｌｅで検索した結果の一例であるが、検索結果の文書の長さには差異があることがわかる。なお、前記問い合わせ語中の「ａｎｄ」は、アンド検索を示す。さらに、この結果より、問い合わせ語に対応する“話題”に関連する領域の大きさ、さらには詳細度も文書ごとに差異があると推察できる。
従って、検索結果の各文書で、問い合わせ語に対応する“話題”に対応する領域を抽出し、領域の大きさを話題の詳細度として比較することで、検索結果を話題の詳細度でランキングすることができ、これにより検索結果の利便性を高めることができるものと考える。 By the way, for example, FIG. 1 is an example of a query word “postal privatization and management”, which is an example of a search result by Google, but it can be seen that there is a difference in the length of the search result document. Note that “and” in the inquiry word indicates an AND search. Furthermore, from this result, it can be inferred that the size of the area related to the “topic” corresponding to the query word and the level of detail are also different for each document.
Therefore, by extracting the area corresponding to the “topic” corresponding to the query word in each document of the search result, and comparing the size of the area as the detail level of the topic, the search result is ranked by the detail level of the topic. It is possible to improve the convenience of search results.

また、検索結果の文書には問い合わせ語に対応する話題だけでなく、それとは異なるが共通に存在する話題、及び話題領域も存在するはずであり、これらの情報も併せて使うことにより、個々の文書内容を考慮した豊富なランキングが提供できると考えられる。
さらに言えば、異なる文書における特定の話題領域が判明することで、もし領域の大きさが異なっていれば、ある文書が含む話題領域からみると、他の文書の話題領域は、要約であったり、詳細記述であったりと考えることができ、検索結果の再ランキングにととまらず、文書集合の分析や情報抽出にも利用可能であると考えられる。なお、単文連鎖が話題領域を、単文連接集合が単一の話題の話題領域の集合を、また単文連接集合群が、話題領域集合族を夫々示すものとする。 In addition, in the search result document, there should be not only the topic corresponding to the query word, but also the topic and topic area that are different but common, and by using these information together, It is considered that an abundant ranking considering document contents can be provided.
Furthermore, by identifying specific topic areas in different documents, if the areas are different in size, the topic areas of other documents may be summaries when viewed from the topic areas included in one document. It can be considered that it is a detailed description, and it can be used not only for the re-ranking of search results but also for analysis of document sets and information extraction. It is assumed that the simple sentence chain indicates a topic area, the single sentence concatenation set indicates a set of topic areas of a single topic, and the single sentence concatenation set group indicates a topic area set family.

本発明では、文書から単文を抽出し、単文間類似度を求め、単文間類似度と単文の文書内出現位置情報から類似する単文連接集合群を抽出し、抽出した類似短文連接集合群をもとに文書間の関連度を示すスコアを算出し、算出したスコアを基に文書のランキングを行うことにより、文書の話題の量や密度に関連した基準で文書を順位付けできる文書ランキング方法及び文書ランキング装置を提供することを目的とする。また、類似検索した結果に対し、前述のランキング方法を採用することで、検索結果が、文書が含む話題の量や密度に関連した基準で順位付けされている文書検索方法及び文書検索装置を提供することも目的とする。 In the present invention, a single sentence is extracted from a document, a similarity between single sentences is obtained, a similar single sentence connected set group is extracted from the similarity between single sentences and the appearance position information of the single sentence in the document, and the extracted similar short sentence connected set group is also obtained. Document ranking method and document which can rank documents according to criteria related to the amount and density of document topics by calculating a score indicating the degree of association between the documents and ranking the documents based on the calculated score An object is to provide a ranking device. In addition, by using the ranking method described above for similar search results, a document search method and a document search device are provided in which search results are ranked according to criteria related to the amount and density of topics included in the document. The purpose is to do.

上記目的を達成するため、請求項１に記載の本発明は、文書を入力する文書入力ステップと、前記文書から単文を抽出する単文抽出ステップと、前記単文抽出ステップにより抽出した単文間の類似度を算出する単文間類似度算出ステップと、前記単文の類似関係と前記単文の文書内の出現位置情報から類似する単文連接集合群を抽出する類似単文連接集合群抽出ステップと、前記類似単文連接集合群抽出ステップにより抽出された類似単文連接集合群に基づいて文書のスコアを算出する文書スコア算出ステップとを有する文書ランキング方法を特徴とする。
請求項２に記載の本発明は、請求項１に記載の文書ランキング方法において、前記文書から形態素を抽出する形態素解析ステップをさらに有し、前記単文抽出ステップにおいて、前記形態素解析ステップにおいて解析された形態素情報に基づいて単文の抽出を行い、
前記単文間類似度算出ステップにおいて解析された形態素情報を基に単文間類似度を算出することを特徴とする。 To achieve the above object, the present invention according to claim 1 includes a document input step for inputting a document, a single sentence extraction step for extracting a single sentence from the document, and a similarity between the single sentences extracted by the single sentence extraction step. Calculating a similarity between single sentences, a similar single sentence connected set group extracting step for extracting a similar single sentence connected set group from the similarity relation of the single sentence and appearance position information in the document of the single sentence, and the similar single sentence connected set The document ranking method includes a document score calculating step of calculating a score of a document based on the similar single sentence concatenated set group extracted by the group extracting step.
According to a second aspect of the present invention, the document ranking method according to the first aspect further includes a morpheme analysis step of extracting a morpheme from the document, and the simple sentence extraction step is analyzed in the morpheme analysis step. Extract simple sentences based on morphological information,
The similarity between single sentences is calculated based on the morpheme information analyzed in the single sentence similarity calculation step.

請求項３に記載の本発明は、請求項２に記載の文書ランキング方法において、前記文書スコア算出ステップは、前記文書が含む類似単文連接集合の重要度と類似単文連接集合を含む割合とによりスコアを算出することを特徴とする。
請求項４に記載の本発明は、請求項３に記載の文書ランキング方法と、類似する文書を検索する文書検索ステップと、検索結果を提示する検索結果提示ステップと、を有し、前記検索結果提示ステップにて提示される検索結果は、前記文書ランキング方法によって順位付けられている文書検索方法を特徴とする。
請求項５に記載の本発明は、文書を入力する文書入力手段と、前記文書から単文を抽出する単文抽出手段と、前記単文抽出手段により抽出した単文間の類似度を算出する単文間類似度算出手段と、前記単文の類似関係と前記単文の文書内の出現位置情報から類似する単文連接集合群を抽出する類似単文連接集合群抽出手段と、前記類似単文連接集合群抽出手段により抽出された類似単文連接集合群に基づいて文書のスコアを算出する文書スコア算出手段と、を備える文書ランキング装置を特徴とする。 According to a third aspect of the present invention, in the document ranking method according to the second aspect, in the document score calculating step, the score is calculated based on the importance of the similar single sentence concatenation set included in the document and the ratio including the similar single sentence concatenation set. Is calculated.
A fourth aspect of the present invention includes the document ranking method according to the third aspect, a document search step for searching for similar documents, and a search result presentation step for presenting search results. The search result presented in the presenting step is characterized by a document search method ranked by the document ranking method.
The present invention according to claim 5 is a document input means for inputting a document, a single sentence extraction means for extracting a single sentence from the document, and a similarity between single sentences for calculating a similarity between single sentences extracted by the single sentence extraction means. Extracted by a calculating means, a similar single sentence connected set group extracting means for extracting a similar single sentence connected set group from the similarity relation of the single sentence and the appearance position information in the document of the single sentence, and the similar single sentence connected set group extracting means A document ranking apparatus comprising: a document score calculating unit that calculates a score of a document based on a group of similar single sentence connected sets.

請求項６に記載の本発明は、請求項５に記載の文書ランキング装置において、前記文書から形態素を抽出する形態素解析手段をさらに備え、前記単文抽出手段において、前記形態素解析手段において解析された形態素情報に基づいて単文の抽出を行い、前記単文間類似度算出手段において解析された形態素情報を基に単文間類似度を算出することを特徴とする。
請求項７に記載の本発明は、請求項６に記載の文書ランキング装置において、前記文書スコア算出手段は、前記文書が含む類似単文連接集合の重要度と類似単文連接集合を含む割合とによりスコアを算出することを特徴とする。
請求項８に記載の本発明は、請求項７に記載の文書ランキング装置において、文書情報データベースを備え、前記文書入力手段にて入力される文書の識別子、前記形態素解析手段にて抽出された文書の形態素解析結果、前記単文抽出手段で抽出された文書の単文情報、前記単文間類似度算出手段で算出された単文間類似度、前記類似単文連接集合群抽出手段で抽出された類似単文連接集合群、及び前記文書スコア算出手段で算出された文書スコアを夫々適切な形式で前記文書データベースに記憶することを特徴とする。 A sixth aspect of the present invention is the document ranking apparatus according to the fifth aspect, further comprising a morpheme analyzing unit that extracts a morpheme from the document, wherein the simple sentence extracting unit analyzes the morpheme analyzed by the morpheme analyzing unit. A single sentence is extracted based on the information, and a single sentence similarity is calculated based on morpheme information analyzed by the single sentence similarity calculation means.
According to a seventh aspect of the present invention, in the document ranking apparatus according to the sixth aspect, the document score calculating means scores based on the importance of the similar single sentence concatenation set included in the document and the ratio including the similar single sentence concatenation set. Is calculated.
The invention according to claim 8 is the document ranking apparatus according to claim 7, comprising a document information database, an identifier of a document input by the document input means, and a document extracted by the morpheme analysis means Morphological analysis results, single sentence information of the document extracted by the single sentence extraction means, similarity between single sentences calculated by the similarity calculation means between single sentences, similar single sentence concatenated sets extracted by the similar single sentence connected set group extraction means The document score calculated by the group and the document score calculating means is stored in the document database in an appropriate format.

請求項９に記載の本発明は、請求項８に記載の文書ランキング装置と、類似する文書を検索する文書検索手段と、前記検索結果を提示する検索結果提示手段と、を備え、前記検索結果提示手段により提示される検索結果が前記文書ランキング装置によって順位付けられていることを特徴とする。
請求項１０に記載の本発明は、請求項１乃至３の何れか１項に記載した文書ランキング方法を実行するプログラムがコンピュータの読み取り可能な形式により記録されている記録媒体を特徴とする。
請求項１１に記載の本発明は、請求項４に記載した文書検索方法を実行するプログラムがコンピュータの読み取り可能な形式により記録されている記録媒体を特徴とする。 The present invention described in claim 9 comprises: the document ranking apparatus according to claim 8; document search means for searching for similar documents; and search result presentation means for presenting the search results. The search results presented by the presenting means are ranked by the document ranking device.
A tenth aspect of the present invention is a recording medium on which a program for executing the document ranking method according to any one of the first to third aspects is recorded in a computer-readable format.
The present invention described in claim 11 is a recording medium in which a program for executing the document search method described in claim 4 is recorded in a computer-readable format.

本発明によれば、文書の話題の量や密度に関連した基準で文書を順位付けできるランキング方法を提供することができる。
また類似文書を検索した結果に対し、前述のランキング方法を採用することで、検索結果が、文書が含む話題の量や密度に関連した基準で順位付けされている類似文書検索方法を提供するができる。 ADVANTAGE OF THE INVENTION According to this invention, the ranking method which can rank a document by the reference | standard relevant to the quantity and density of the topic of a document can be provided.
Further, by adopting the ranking method described above for similar document search results, a similar document search method is provided in which the search results are ranked according to criteria related to the amount and density of topics included in the document. it can.

以下、図面を参照しながら本発明の実施形態を説明する。
図２は本発明の一実施形態である情報抽出装置の構成例である。
この図２に示す情報抽出装置はコンピュータにより構成され、文書を登録する入力手段としてのキーボード２、外部からの信号を受信したり、本実施形態の情報抽出装置から信号を送信したりする通信手段である通信Ｉ／Ｏインターフェース３、本実施形態の情報抽出装置における処理を集中して実行するＣＰＵ４、メモリ（揮発性のＲＡＭと不揮発性のＲＯＭとどちらも想定可能）５、記憶手段としてのハードディスク６、出力手段としてのディスプレイ７やプリンター８などを有する。通信Ｉ／Ｏインターフェース３は、モデムやターミナルアダプタなどが想定でき、通信回線を介してイントラネットまたはインターネット１０に接続されているサーバなどからデータを受信できる。ＣＰＵ４は、メモリ５に記録された手順に従ってプログラムを実行する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 2 is a configuration example of an information extraction apparatus according to an embodiment of the present invention.
The information extraction apparatus shown in FIG. 2 is constituted by a computer, and is a keyboard 2 as an input means for registering a document, a communication means for receiving a signal from the outside and transmitting a signal from the information extraction apparatus of this embodiment. A communication I / O interface 3, a CPU 4 that centrally executes processing in the information extraction apparatus of the present embodiment, a memory (both a volatile RAM and a nonvolatile ROM can be assumed) 5, and a hard disk as storage means 6. A display 7 and a printer 8 are provided as output means. The communication I / O interface 3 can be a modem or a terminal adapter, and can receive data from a server connected to the intranet or the Internet 10 via a communication line. The CPU 4 executes the program according to the procedure recorded in the memory 5.

図３は、本実施形態の情報抽出装置が実行する文書ランキング処理を示したフローチャートである。なお、図３に示す処理は、図２に示すＣＰＵ４がメモリ５に記録された手順に従ってプログラムを実行することにより実現されるものである。
この場合、ＣＰＵ４は、先ずステップＳ１において文書を入力する文書入力処理を実行する。次にステップＳ２において文書データベース１１を利用して文書から形態素を抽出する形態素解析処理を実行し、ステップＳ３において文書から単文を抽出する単文抽出処理を実行する。次に、ステップＳ４において単文抽出処理により抽出した単文間の類似度を算出する単文間類似度算出処理を実行し、続くステップＳ５において文書データベース１１を利用して単文の類似関係と単文の文書内の出現位置情報から類似する単文連接集合群を抽出する類似単文連接集合群抽出処理を実行する。この後、ステップＳ６において抽出された類似単文連接集合群に基づいて文書のスコアを算出する文書間スコア算出処理を実行して処理を終えるようにしている。 FIG. 3 is a flowchart showing document ranking processing executed by the information extraction apparatus of this embodiment. Note that the processing shown in FIG. 3 is realized by the CPU 4 shown in FIG. 2 executing a program according to the procedure recorded in the memory 5.
In this case, the CPU 4 first executes a document input process for inputting a document in step S1. Next, in step S2, a morpheme analysis process for extracting morphemes from the document using the document database 11 is executed, and in step S3, a simple sentence extraction process for extracting simple sentences from the document is executed. Next, a single sentence similarity calculation process for calculating the similarity between the single sentences extracted by the single sentence extraction process in step S4 is executed, and in the subsequent step S5, the single database similarity relation and the single sentence in the single sentence document are used. A similar single sentence concatenated set group extraction process for extracting a similar single sentence concatenated set group from the appearance position information is executed. Thereafter, the inter-document score calculation process for calculating the score of the document based on the similar single sentence concatenated set extracted in step S6 is executed to finish the process.

また図４は本実施形態の情報抽出装置が実行する文書検索処理を示したフローチャートである。なお、図４に示す処理は、図２に示すＣＰＵ４がメモリ５に記録された手順に従ってプログラムを実行することにより実現されるものである。
この場合、ＣＰＵ４はステップＳ１１においてさらに類似する文書を検索する文書検索処理を実行し、続くステップＳ１２において図３に示したような文書ランキング方法により文書ランキング処理を実行した後、ステップＳ１３において、検索結果提示処理にて提示処理を実行するようにしている。この場合、検索提示処理により提示される検索結果は、検索結果の文書に上記図３に示した文書ランキング方法を適用することによって順位付けを行うようにしている。 FIG. 4 is a flowchart showing document search processing executed by the information extraction apparatus of this embodiment. Note that the processing shown in FIG. 4 is realized by the CPU 4 shown in FIG. 2 executing a program according to the procedure recorded in the memory 5.
In this case, the CPU 4 executes a document search process for searching for a similar document in step S11, and in the subsequent step S12, executes the document ranking process by the document ranking method as shown in FIG. The presentation process is executed in the result presentation process. In this case, the search results presented by the search presentation process are ranked by applying the document ranking method shown in FIG. 3 to the search result documents.

以下、上記した各処理について詳細に説明する。
＜文書入力処理＞
本実施形態では、文書は、ユーザやアプリケーションによって１つの単位として定められる文字列と定義する。ここでは、図１の文書１から文書４の夫々を文書とし、これらの文書データが適切な形式で入力されるものとする。
入力した文書夫々に固有の識別番号を与える。即ち、文書１から文書４までの文書データの識別番号１から４までの数字を与え、これを文書情報データベースに記録する。
なお、文書情報データベースは、図２に示したハードディスク６などの補助記憶装置上に構築しても良いし、またメモリ５等の主記憶上に構築してもよい。 Hereinafter, each process described above will be described in detail.
<Document input processing>
In this embodiment, a document is defined as a character string determined as one unit by a user or an application. Here, it is assumed that each of document 1 to document 4 in FIG. 1 is a document, and these document data are input in an appropriate format.
A unique identification number is given to each input document. That is, numbers 1 to 4 of document data from document 1 to document 4 are given and recorded in the document information database.
The document information database may be constructed on an auxiliary storage device such as the hard disk 6 shown in FIG. 2 or may be constructed on a main memory such as the memory 5.

＜形態素解析処理＞
上記ステップＳ２における形態素解析処理（文書解析処理）では、入力された文書に対し形態素解析処理を施し、文書から形態素列を抽出する。なお、本実施形態では、日本語文書を例示に用いているため、日本語形態素解析処理の動作例を示すが、文書は英語その他外国語であってもよく、その際は、対応する形態素解析システムを用いればよい。
文書識別子１（文書１）の文書の一部、“小泉純一郎首相は２５日午前、武部勤自民党幹事長と首相官邸で会い、”というテキストに対し形態素解析を適用した結果を図５に示す。なお、本発明においては特別な形態素解析システムは必要とせず、形態素と品詞情報が得られるものであればいずれのものでも利用してよい。例えば、日本語形態素解析システムとしては、茶筅（http://chasen.naist.jp/hiki/ChaSen/）が良く知られている。 <Morphological analysis>
In the morpheme analysis process (document analysis process) in step S2, the input document is subjected to a morpheme analysis process to extract a morpheme string from the document. In this embodiment, since a Japanese document is used as an example, an operation example of Japanese morphological analysis processing is shown. However, the document may be in English or another foreign language, and in that case, the corresponding morphological analysis is performed. A system may be used.
FIG. 5 shows a result of applying morphological analysis to a part of the document with document identifier 1 (document 1), “Prime Minister Junichiro Koizumi meets with the LDP secretary-general at the Prime Minister's office on the morning of 25th”. In the present invention, no special morphological analysis system is required, and any morpheme and part-of-speech information may be used. For example, a tea bowl (http://chasen.naist.jp/hiki/ChaSen/) is well known as a Japanese morphological analysis system.

図５において、各行が各形態素に対応し、形態素の情報として、表記、品詞情報、及び識別子を記述している。
本実施形態では形態素の品詞として、一般的に自立語品詞とされるものと付属語品詞とされているものの大別を行い、付属語品詞は記号“付”を割り当てている。さらに、自立語品詞は、名詞、未登録語など体言系品詞とされるものは記号“自体”を、動詞、形容動詞など用言系品詞とされるものは記号“自用”を、また、記号は付属語にするが、例外として、句点は記号“句点”を、読点は記号“読点”を割り当てている。
形態素識別子は、形態素の表記と品詞情報が共に異なる場合に異なる識別子を与えているが、簡単のため形態素解析システムが提供する識別子を用いてもよい。
明示しないが文書１の残りのテキストと他の文書についても形態素解析を適用し、解析結果は適切な形式で文書情報データベースに記録される。 In FIG. 5, each line corresponds to each morpheme, and description, part-of-speech information, and an identifier are described as morpheme information.
In the present embodiment, the morpheme parts of speech are generally classified into those that are generally independent parts of speech and those that are attached parts of speech. In addition, self-supporting part-of-speech is the symbol “self” if it is a verbal part of speech such as a noun or unregistered word, the symbol “self-use” if it is a part-of-speech part of speech such as a verb or adjective verb. Is an attached word, except that the symbol “punctuation” is assigned to the punctuation and the symbol “reading” is assigned to the punctuation.
As the morpheme identifier, different identifiers are given when the morpheme notation and the part-of-speech information are different, but an identifier provided by the morpheme analysis system may be used for simplicity.
Although not explicitly shown, the morphological analysis is applied to the remaining text of the document 1 and other documents, and the analysis result is recorded in the document information database in an appropriate format.

形態素解析処理で生成されるデータの一例として、図６に文書情報テーブル、図７に文書識別子１の形態素リストテーブル、及び図８に形態素情報テーブルの一例を夫々示す。
図６に示す文書情報テーブルは、文書に布置された文書識別子と文書に対応する形態素リストテーブルの形態素リスト識別子、及び文書に対応する単文リストテーブルの単文リスト識別子で構成されている。本実施形態では、文書識別子、形態素リスト識別子及び単文リスト識別子として数値を用い、かつ同一の値を布置しているが、識別子は異なる固有な数値でも、固有な文字情報でもよい。
図７に示す形態素リストテーブルは、個々の文書を構成する全ての形態素識別子により構成されており、出現順にリスト化される。形態素リストの形態素識別子を、図８に示す形態素情報テーブルを参照し、昇順に展開することで、文書を再現することができる。
従って、本データを保持していれば、オリジナルの文書データを文書情報データベースに記憶しておく必要はない。また、形態素リスト識別子の頻度を求めることで、文書内の形態素頻度情報を簡単に求めることもできる。 As an example of data generated by the morpheme analysis processing, FIG. 6 shows a document information table, FIG. 7 shows a morpheme list table of document identifier 1, and FIG. 8 shows an example of a morpheme information table.
The document information table shown in FIG. 6 includes a document identifier arranged in a document, a morpheme list identifier of a morpheme list table corresponding to the document, and a single sentence list identifier of a single sentence list table corresponding to the document. In the present embodiment, numerical values are used as document identifiers, morpheme list identifiers, and simple sentence list identifiers, and the same values are arranged. However, the identifiers may be different unique numerical values or unique character information.
The morpheme list table shown in FIG. 7 is composed of all morpheme identifiers constituting individual documents and is listed in the order of appearance. A document can be reproduced by expanding the morpheme identifiers in the morpheme list in ascending order with reference to the morpheme information table shown in FIG.
Therefore, if this data is held, it is not necessary to store the original document data in the document information database. Further, by obtaining the frequency of the morpheme list identifier, it is possible to easily obtain the morpheme frequency information in the document.

図８に示す形態素情報テーブルは、文書から抽出される形態素の識別子、表記、及び品詞情報より構成されており、同一の表記及び品詞情報を持つ形態素のエントリは高々１つである。また、形態素情報テーブルは、全形態素識別子リストテーブルから共有される。
本実施形態では、全文書の解析結果が、図６〜図８に示すテーブル形式で、文書情報データベースに記録されるものとする。 The morpheme information table shown in FIG. 8 includes morpheme identifiers, notation, and part-of-speech information extracted from a document, and there is at most one morpheme entry having the same notation and part-of-speech information. The morpheme information table is shared from all morpheme identifier list tables.
In this embodiment, it is assumed that analysis results of all documents are recorded in the document information database in the table format shown in FIGS.

＜単文抽出処理＞
ステップＳ３における単文抽出処理では、入力された文書から単文を抽出する。本実施形態では文書の形態素列に対する規則をもとに単文の範囲を決定する動作例を示すが、形態素解析を行わず文書文字列に対し直接作用させる規則を用いて単文の範囲を求めてもよい。単文の範囲を決定するために、（式１）に記載の形態素品詞に関する正規表現規則を用いる。
（式１）

なお、式１において“［］”はクラスを、“＾”はクラス文字の否定を、“＋”は１つ以上の連続を、また“｜”は選択を夫々示している。（なお、正規表現については“詳説正規表現第２版”［ＩＳＢＮ４−８７３１１−１３０−７］等を参照すれば良い。 <Single sentence extraction processing>
In the single sentence extraction process in step S3, a single sentence is extracted from the input document. In this embodiment, an example of an operation for determining a range of a single sentence based on a rule for a morpheme string of a document is shown. However, even if a range of a single sentence is obtained using a rule that directly acts on a document character string without performing morpheme analysis. Good. In order to determine the range of a single sentence, the regular expression rule regarding the morpheme part-of-speech described in (Formula 1) is used.
(Formula 1)

In Equation 1, “[]” indicates a class, “^” indicates negation of a class character, “+” indicates one or more sequences, and “|” indicates selection. (Regarding regular expressions, “Detailed Regular Expressions Second Edition” [ISBN4-87311-130-7] or the like may be referred to.

式１は、品詞が句点でも読点でもない品詞の形態素列と、それに続く品詞が自用の形態素と句点の形態素列か、品詞が読点の形態素からなる形態素列にマッチングする正規表現である。
この規則を文書１の形態素解析結果に適用することで求められる単文を図９に示す。文書１からは２つの単文が抽出されており、各単文には固有の識別子を付置している。なお、単文の上段は形態素表記、下段は形態素品詞であり、また視認性にため各形態素には空白の区切りを入れている。
明示しないが文書２から文書４においても単文の抽出を実施し、その結果を図１０に示す。文書１からは文書識別子１と２の２単文、文書２からは識別子３〜６の４単文、文書３からは文書識別子７〜１４の８単文、及び文書４からは文書識別子１５〜２４までの１０単文が抽出される。抽出した文書の単文情報は文書情報データベースに適切な形式で記憶する。 Equation 1 is a regular expression that matches a morpheme sequence of a part of speech whose part of speech is neither a punctuation nor a punctuation, followed by a morpheme sequence of self morpheme and a punctuation point of part of speech, or a morpheme sequence of which the part of speech is a morpheme of reading.
A simple sentence obtained by applying this rule to the morphological analysis result of document 1 is shown in FIG. Two simple sentences are extracted from the document 1, and a unique identifier is attached to each simple sentence. Note that the upper part of the simple sentence is morpheme notation, the lower part is morpheme part of speech, and each morpheme is blank separated for visibility.
Although not clearly shown, extraction of simple sentences is also performed in document 2 to document 4, and the result is shown in FIG. From document 1, two simple sentences of document identifiers 1 and 2, from document 2, four simple sentences of identifiers 3 to 6, from document 3, eight simple sentences of document identifiers 7 to 14, and from document 4 to document identifiers 15 to 24 Ten simple sentences are extracted. The single sentence information of the extracted document is stored in an appropriate format in the document information database.

単文抽出処理において生成されるデータの一例として、図１１に文書識別子１の単文終端位置リストテーブルを示す。
単文は文書内で連続して出現するため、文書における単文の終端形態素の位置情報により、単文を一意に同定することができる。従って、単文終端位置リストは、単文の終端形態素の、形態素リストテーブルにおけるインデックス番号（出現位置）を、単文の出現順にリスト化したものである。
即ち、図８においては、文書１の句点の、図７のリストテーブルにおけるインデックス番号である８と、図７には明示していないが、文書１の読点のインデックス番号である３８が記載される。本実施形態では、全文書の単文抽出結果が、図１１に示すテーブル形式で、文書情報データベースに記録されるものとする。 As an example of data generated in the single sentence extraction process, FIG. 11 shows a single sentence end position list table of the document identifier 1.
Since simple sentences appear continuously in a document, simple sentences can be uniquely identified by the position information of the terminal sentence morphemes in the document. Therefore, the single sentence end position list is a list of index numbers (appearance positions) in the morpheme list table of the terminal morphemes of the single sentence in the order of appearance of the single sentences.
That is, in FIG. 8, the index number 8 in the list table of FIG. 7 of the punctuation point of document 1 and 38, which is the index number of the reading point of document 1, are not explicitly shown in FIG. . In the present embodiment, it is assumed that simple sentence extraction results of all documents are recorded in the document information database in the table format shown in FIG.

＜単文間類似度算出処理＞
上記ステップＳ４における単文間類似度算出処理では、抽出した各単文の任意の１対の類似度を算出する。本実施形態では、各単文の品詞が自立語の形態素の頻度情報（頻度ベクトル）を用いて、単文間の類似度を（式２）に示すベクトル間の余弦として算出する動作例を示す。なお、単文の頻度ベクトルについては、図７の形態素リストテーブルと図１１の単文終端位置リストテーブルを用いれば、簡単に求めることができるので、ここではその詳細については明示しない。
（式２）

上記式２において、ｓ１とｓ２は、文書全体で固有の自立語品詞をもつ形態素数と同一の次元をもつ、夫々単文１と単文２の形態素出現頻度ベクトルであり、・はベクトルの内積を、また｜｜はベクトルのノルムを夫々示す。
なお、本実施形態で採用している単文間類似度は文書間類似度において極めて一般的なものであり、文書間類似度に関しては、様々な先行研究がなされているため、単文を文書と見なすことでそれらの類似度を導入することができる。 <Simple sentence similarity calculation processing>
In the single sentence similarity calculation process in step S4, an arbitrary pair of similarities of each extracted single sentence is calculated. In the present embodiment, an example of operation is shown in which the similarity between simple sentences is calculated as a cosine between vectors shown in (Equation 2) using frequency information (frequency vectors) of morphemes where the part of speech of each single sentence is an independent word. Note that the frequency vector of simple sentences can be easily obtained by using the morpheme list table of FIG. 7 and the simple sentence end position list table of FIG.
(Formula 2)

In the above equation 2, s1 and s2 are morpheme appearance frequency vectors of single sentence 1 and single sentence 2, respectively, having the same dimension as the number of morphemes having unique independent part-of-speech in the entire document. || indicates the norm of the vector.
Note that the similarity between single sentences adopted in the present embodiment is very general in the similarity between documents, and regarding the similarity between documents, various previous studies have been made, so a single sentence is regarded as a document. The similarity can be introduced.

図１０に示す各単文の明示しない形態素頻度ベクトルを（式２）に適用して算出した結果を図１２に示す。なお、図１２における行列は、夫々単文識別子であり、行列の要素は、行と列の識別子に対応する単文間の余弦類似度であり、類似度が０．２以上の要素は背景を灰色にしている。また、本実施形態では類似度行列は対象行列になるので下三角成分と、同一単文間類似度は表示していない。例えば、単文１と単文３の類似度は、０．６３である。
図１２の単文間類似度もまた文書データベースに適切な形式で記憶する。なお、単文間類似度は、文書数が大きくなるとデータ量も膨大になるため、閾値処理を行い、閾値以下の要素値を全て０とし、疎形式のデータ構造を採用することでデータ量を削減可能である。 FIG. 12 shows the result of calculation by applying the morpheme frequency vector of each simple sentence shown in FIG. 10 to (Equation 2). The matrix in FIG. 12 is a single sentence identifier, the elements of the matrix are the cosine similarity between single sentences corresponding to the row and column identifiers, and the elements with similarity of 0.2 or more have a gray background. ing. In the present embodiment, since the similarity matrix is a target matrix, the lower triangular component and the similarity between the same single sentences are not displayed. For example, the similarity between simple sentence 1 and simple sentence 3 is 0.63.
The similarity between single sentences in FIG. 12 is also stored in an appropriate format in the document database. Note that the amount of data for the similarity between single sentences increases as the number of documents increases. Therefore, threshold processing is performed, all element values below the threshold are set to 0, and the data volume is reduced by adopting a sparse data structure. Is possible.

疎形式データの例として、図１２の単文間類似度の一部の有効行インデックスリストテーブルと列インデックス−値リストテーブルを図１３に示す。
有効行インデックスリストテーブルの各値は、図１２の行列の閾値以上の値の要素を１つ以上持つ行のインデックス番号である。また、列インデックス−値リストテーブルの各値は、有効行インデックスリストテーブルにエントリされる行の閾値以上の値と対応する列インデックス番号である。
なお、有効インデックスリストテーブルの行インデックスと対応する列インデックス−値リストテーブルの対応については、ここでは明示していないが、列インデックス−値リストテーブル自体をリスト化すれば、有効インデックスリストテーブルとは容易に１対１対応にすることができるし、有効インデックスリストテーブルの各要素に対応する列インデックス−値リストテーブルへの参照情報を持たせてもよい。
本実施形態では、全単文間類似度の算出結果が、図１３に示すテーブル形式で、文書情報データベースに記録されるものとする。 As an example of sparse format data, FIG. 13 shows a part of the effective row index list table and the column index-value list table of the similarity between single sentences in FIG.
Each value in the valid row index list table is an index number of a row having one or more elements having a value equal to or larger than the threshold value of the matrix in FIG. Each value in the column index-value list table is a column index number corresponding to a value equal to or greater than the threshold value of the row entered in the valid row index list table.
The correspondence between the row index of the effective index list table and the corresponding column index-value list table is not explicitly shown here, but if the column index-value list table itself is listed, what is an effective index list table? One-to-one correspondence can be easily made, and reference information to the column index-value list table corresponding to each element of the effective index list table may be provided.
In the present embodiment, it is assumed that the calculation result of the similarity between all single sentences is recorded in the document information database in the table format shown in FIG.

＜類似単文連接集合群抽出処理＞
上記ステップＳ５に示す類似単文連接集合群抽出処理では、抽出した単文の単文間類似度と文書における出現位置に基づき、文書内で隣接し、かつ単文間類似度が一定の値以上である単文集合を全て抽出する。本実施形態では、単文間類似度算出処理までの動作例を継承し、図１２に示される単文間類似度行列をもとに類似単文連接集合群を抽出する動作を示す。
図１２の単文間類似度行列において、類似度が０．２以上のものは１、０．２以下のものは０としたものを図１４に示す。なお、図１４において、要素値が１のものは背景を灰色、０のものは白色にしており、また各文書の境界のために線を引いている。 <Similar simple sentence connected set group extraction processing>
In the similar single sentence concatenated set extraction process shown in step S5, based on the extracted single sentence inter-sentence similarity and the appearance position in the document, the single sentence set that is adjacent in the document and the similarity between the single sentences is equal to or greater than a certain value. Are all extracted. In the present embodiment, an operation example up to the processing for calculating the similarity between single sentences is inherited, and an operation for extracting a group of similar single sentence connected sets based on the similarity matrix between single sentences shown in FIG. 12 is shown.
In the inter-sentence similarity matrix of FIG. 12, FIG. 14 shows that the similarity is 1 or more when the similarity is 0.2 or more, and 0 when the similarity is 0.2 or less. In FIG. 14, when the element value is 1, the background is gray, and when the element value is 0, the color is white, and a line is drawn for each document boundary.

本実施形態では、隣接する同一文書内の単文対を要素とする集合を１つのノードと、また単文対間の類似関係をエッジと見なすことで生成されるグラフの連結成分を抽出することで単文対類似グラフを生成し、さらに各連結成分のノード対の文書内における隣接関係もとにノードの融合と連結成分の結合を行うことで類似単文連接集合群を生成する。処理フローを図１５に示す。
この場合、ＣＰＵ４は、まず、ステップＳ２１において、同一文書内の隣接する２つの短文を要素とする集合を生成する。次にステップＳ２２において生成した集合をノードとみなし、異なる文書に含まれる２つのノードを構成する単文すべてに１つ以上の類似関係が存在している場合、ノード間にエッチングを結び、グラフを生成する。そして続くステップＳ２３においてグラフが生成されたか否かの判別を行う。そしてステップＳ２３において肯定結果が得られた場合（Ｓ２３で「Ｙｅｓ」）、ステップＳ２４に進み、生成したグラフの連結成分を抽出した後、続くステップＳ２５において異なる連結成分における任意の２つのノード対において、対応する各ノードの積集合体がいずれも空でない場合、対応するノードの輪集合を新しいノードとして、グラフを連結する。そして最後にステップＳ２６において生成された各連結成分を類似短文連接集合として抽出して処理を終えるようにする。なお、ステップＳ２３において否定結果が得られた場合（Ｓ２３で「Ｎｏ」）はそのまま処理を終えることになる。 In the present embodiment, a single sentence is extracted by extracting a connected component of a graph generated by regarding a set having a single sentence pair in an adjacent document as one node and a similarity between the single sentence pairs as an edge. A pair-similar graph is generated, and a similar single sentence concatenated set group is generated by fusing nodes and combining connected components based on the adjacent relationship in the document of the node pair of each connected component. The processing flow is shown in FIG.
In this case, first, in step S21, the CPU 4 generates a set including two adjacent short sentences in the same document as elements. Next, the set generated in step S22 is regarded as a node, and if one or more similar relations exist in all the simple sentences constituting two nodes included in different documents, etching is performed between the nodes to generate a graph. To do. In subsequent step S23, it is determined whether or not a graph is generated. If an affirmative result is obtained in step S23 ("Yes" in S23), the process proceeds to step S24, and after extracting the connected components of the generated graph, in any two node pairs in different connected components in the subsequent step S25 If none of the product sets of the corresponding nodes are empty, the graphs are connected using the ring set of the corresponding nodes as a new node. Finally, each connected component generated in step S26 is extracted as a similar short sentence concatenation set to finish the process. If a negative result is obtained in step S23 ("No" in S23), the process is finished as it is.

以下、具体的に説明すると、例えば、文書１では｛１、２｝が１つのノードなり、また文書２では、｛３、４｝や｛４、５｝等がノードとなる。
次に、図１４より各ノード間にエッジをひく。エッジをひく条件は、ノードを構成する単文間ですくなくとも１つ以上の単文と閾値以上の類似度を有することである（本実施形態の場合、０、２であり、図１４では要素値１が対応している）。例えば、ノード｛１、２｝とノード｛３、４｝、ノード｛１、２｝とノード｛７、８｝にエッジをひくことができる。
結果、図１４からは３つの連結成分（グラフ）を抽出でき、その結果を図１６に示す。
次に、抽出した連結成分ノード対の文書内における隣接関係もとにノードの融合と連結成分の結合を行う。 More specifically, for example, {1, 2} is one node in document 1, and {3, 4}, {4, 5}, etc. are nodes in document 2.
Next, an edge is drawn between each node from FIG. The condition for drawing an edge is that at least one simple sentence between the single sentences constituting the node has a similarity equal to or higher than a threshold (in this embodiment, 0 and 2; in FIG. 14, the element value 1 is 1). Yes) For example, edges can be drawn on the nodes {1, 2} and {3, 4}, the nodes {1, 2} and the nodes {7, 8}.
As a result, three connected components (graphs) can be extracted from FIG. 14, and the results are shown in FIG.
Next, node fusion and connection component combination are performed based on the adjacent relationship in the document of the extracted connected component node pair.

本実施形態では、ノードの融合と連結成分の結合の条件を、“異なる連結成分における任意の２つのノード対において、対応する各ノードの積集合がいずれも空でない場合、対応するノードの和集合を新しいノードとして、グラフを連結する”こととする。
図１６のグラフでは、グラフ１の｛７、８｝−｛１５、１６｝成分とグラフ２の｛８、９｝−｛１６、１７｝成分が条件を満たすノート対であるので、｛７、８｝と｛８、９｝、また｛１５、１６｝と｛１６、１７｝のノードを夫々融合し、あらたなノード｛７、８、９｝と｛１５、１６、１７｝としてグラフ１とグラフ２を結合する。
結果、図１４のグラフ１とグラフ２を結合し（グラフ１’）、その結果を図１７に示す。この図１７の各グラフが、類似単文連接集合になる。 In the present embodiment, the condition of the fusion of nodes and the combination of connected components is as follows: “If any two node pairs in different connected components are not empty, the union of the corresponding nodes Let's connect the graphs with the new node as "."
In the graph of FIG. 16, since the {7, 8}-{15, 16} component of the graph 1 and the {8, 9}-{16, 17} component of the graph 2 are the note pairs that satisfy the condition, {7, 8} and {8, 9}, and the nodes {15, 16} and {16, 17} are merged, respectively, and the new nodes {7, 8, 9} and {15, 16, 17} Combine graph 2.
As a result, the graph 1 and the graph 2 in FIG. 14 are combined (graph 1 ′), and the result is shown in FIG. Each graph of FIG. 17 becomes a similar single sentence connected set.

なお、本実施形態では、初期ノードを２つの文書連続する単文としたが、初期ノードとして、１つのノードを中心とした窓関数から生成される単文集合としたり、抽出された連結成分の結合についてもよりノードの融合条件を、ノードの和集合の大きさに閾値をもうけるなどより複雑な仕組みを用いることもできる。
抽出した類似単文連接集合群は適切な形式で、文書情報データベースに記録する。
類似単文連接集合群データの一例として、図１８に前記グラフ１’及びグラフ２の情報を記載した類似単文連接集合群リストテーブルを示す。 In the present embodiment, the initial node is a single sentence that is continuous in two documents. However, the initial node is a single sentence set generated from a window function centered on one node, or a combination of extracted connected components. It is also possible to use a more complicated mechanism such as setting a threshold for the node fusion condition and the union size of the node.
The extracted similar single sentence concatenated sets are recorded in a document information database in an appropriate format.
As an example of similar single sentence connected set group data, FIG. 18 shows a similar single sentence connected set group list table in which the information of the graphs 1 ′ and 2 is described.

ノード識別子は、各グラフのノードに与える識別子であり、単文識別子リストはノードを構成する単文の識別子集合であり、関係ノード識別子リストは、ノードと関係する（辺が結ばれている）他のノードのリストである。たとえば、１行目はノード｛１、２｝のものであり、識別子は１、ノードを構成する単文は文書識別子１と２、及びこのノードと関係するノードは、識別子２、３、４のノードであることを示している。なお、単文識別子リストと関係ノード識別子リストは図１１などのように別途リストテーブルを用意し、この要素にはそのテーブルへの参照情報を記述する形式をとってもよい。
本実施形態では、類似単文連接集合群の算出結果が図１８に示すテーブル形式で文書情報データベースに記録されるものとする。 The node identifier is an identifier to be given to each node of the graph, the single sentence identifier list is a set of identifiers of single sentences constituting the node, and the related node identifier list is another node related to the node (with edges connected). It is a list. For example, the first line is of the node {1, 2}, the identifier is 1, the simple sentences constituting the node are document identifiers 1 and 2, and the nodes related to this node are the nodes of identifiers 2, 3, 4 It is shown that. The single sentence identifier list and the related node identifier list may have a separate list table as shown in FIG. 11 and the like, and this element may take the form of describing reference information to the table.
In the present embodiment, it is assumed that the calculation result of the similar single sentence concatenated set group is recorded in the document information database in the table format shown in FIG.

＜文書スコア算出処理＞
上記図２のステップＳ６に示す文書スコア算出処理では、抽出した類似単文連接集合の情報をもとに文書スコアを算出する。本実施形態では、類似単文連接集合群までの動作例を継承し、図１７に示す類似単文連接集合群が与えられているときに、文書が含む類似単文連接集合の重要度と、類似単文連接集合に含まれる単文の割合をもとに文書スコアを算出する動作を説明する。
まず、スコアの基準として、共通する話題を多く、かつ詳しく含む文書が高いスコアを得ることを考える。１つの共通する話題を図１７における１つのグラフを見なすと、要素数の大きいノードをできるだけ多く含む文書がよいスコアをとればよく、例えば、式３のようにスコアを定式化すればよい。
（式３）

式３を基に文書１から文書４のスコアを算出すると、夫々０．６６、０．６６、２、２となり、文書３、文書４、文書１、文書２の順にランキングされる。 <Document score calculation processing>
In the document score calculation process shown in step S6 of FIG. 2, the document score is calculated based on the extracted information of the similar single sentence concatenation set. In the present embodiment, when the operation example up to the similar single sentence connected set group is inherited and the similar single sentence connected set group shown in FIG. 17 is given, the importance of the similar single sentence connected set included in the document, the similar single sentence connected set, and The operation of calculating the document score based on the ratio of simple sentences included in the set will be described.
First, as a score criterion, consider that a document having many common topics and detailed information obtains a high score. If one graph in FIG. 17 is regarded as one common topic, a document including as many nodes having a large number of elements as possible should have a good score. For example, the score may be formulated as shown in Equation 3.
(Formula 3)

When the scores of the document 1 to the document 4 are calculated based on the expression 3, they are 0.66, 0.66, 2, and 2, and are ranked in the order of the document 3, the document 4, the document 1, and the document 2.

また、スコアの基準として、共通する話題を多く含むが、共通話題以外はできるだけ含まない文書がよいスコアを得ることを考えると、異なるグラフに属しているノード数を多く含み、またノードに属さない単文がない文書がよいスコアをとればよく、例えば、式４のようにスコアを定式化すればよい。
（式４）

式４を基に文書１から文書４のスコアを算出すると、夫々１、０．５、０．６６、０．２となり、文書１、文書３、文書２、文書４の順にランキングされる。
なお、本実施形態では上記２つの基準に基づく動作のみを例示しているが、類似単文連接集合群の情報をもとにより複雑な基準をもとにスコアを算出してもよい。 In addition, as a criterion for scores, considering that a document that contains many common topics but not other common topics as much as possible gets a good score, it contains many nodes that belong to different graphs and does not belong to any nodes. What is necessary is just to take a good score for a document without a single sentence. For example, the score may be formulated as shown in Equation 4.
(Formula 4)

When the scores of the document 4 from the document 1 are calculated based on the expression 4, they are 1, 0.5, 0.66, and 0.2, respectively, and are ranked in the order of the document 1, the document 3, the document 2, and the document 4.
In the present embodiment, only the operation based on the above two criteria is illustrated, but the score may be calculated based on a complex criterion based on the information of the similar single sentence connected set group.

算出した文書スコアを、文書情報データベースに記録する。
図１９に文書スコアデータの例として、前記２つのスコア基準により算出した文書スコアを記載した文書スコアリストテーブルを示す。
文書スコアリストテーブルは、文書識別子、スコア基準１、及びスコア基準２からなり、スコア基準１は前記“共通する話題を多く、かつ詳しく含む文書が高いスコアを得ること”を基準とした文書のスコア値、スコア基準２は前記“共通する話題を多く含むが、共通話題以外はできるだけ含まない文書がよいスコアを得ること”を基準とした文書のスコア値である。
例えば、１行目は文書１のデータであり、識別番号は１、スコア基準１でのスコアは０．６６、スコア基準２でのスコアは１である。
なお、本実施形態では文書データのスコア値による順位付けは明示していないが、スコア値をソーティングすれば容易に求めることができる。
本実施形態では、類似単文連接集合群の算出結果が、図１９に示すテーブル形式で、文書情報データベースに記録されるものとする。
次に、ＣＰＵ４はステップＳ１３において結果提示処理を実行する。 The calculated document score is recorded in the document information database.
FIG. 19 shows a document score list table in which document scores calculated based on the two score criteria are shown as an example of document score data.
The document score list table is composed of a document identifier, a score criterion 1, and a score criterion 2. The score criterion 1 is a document score based on the above-mentioned “getting a high score for documents having many common topics and including details”. The value / score criterion 2 is a score value of a document on the basis of “obtaining a good score for a document that includes many common topics but does not include as much as possible other than common topics”.
For example, the first line is data of the document 1, the identification number is 1, the score on the score criterion 1 is 0.66, and the score on the score criterion 2 is 1.
In this embodiment, the ranking based on the score value of the document data is not clearly shown, but it can be easily obtained by sorting the score value.
In the present embodiment, it is assumed that the calculation result of the similar single sentence connected set group is recorded in the document information database in the table format shown in FIG.
Next, CPU4 performs a result presentation process in step S13.

＜文書検索処理＞
ステップＳ７に示す文書検索処理では、適切に文書の検索を行えるものであればどのようなものであってもよく、例えば前述のｇｏｏｇｌｅの検索結果を適用すればよい。
例えば、文書検索処理で取得した検索結果として、文書のＵＲＬが取得されている場合、ｗｇｅｔ(例えば、http://wget.sunsite.dk/を参照)等の既知のＨＴＭＬ文書取得ツールを用いることで、ＨＴＭＬ文書を取得し、さらにhtml2text(例えば、http://search.cpan.org/~awrigley/html2text-0.003/html2text.plを参照)等の既知のＨＴＭＬ文書をプレーンテキストに変換するツールを用いることで、取得したＨＴＭＬ文書をプレーンテキストに変換する。そして、取得した検索結果のプレーンテキストを前記文書データベースに登録する。 <Document search processing>
In the document search process shown in step S7, any document search process can be used as long as the document search can be appropriately performed. For example, the Google search result described above may be applied.
For example, when a URL of a document is acquired as a search result acquired in the document search process, a known HTML document acquisition tool such as wget (for example, see http://wget.sunsite.dk/) is used. A tool for acquiring HTML documents and converting known HTML documents such as html2text (for example, see http://search.cpan.org/~awrigley/html2text-0.003/html2text.pl) into plain text By using it, the acquired HTML document is converted into plain text. Then, the plain text of the acquired search result is registered in the document database.

＜結果提示処理＞
例えば、図２０に前記文書１から文書４のランキングの表示例を示す。
図２０では、文書１から文書４（行方向）までのランキング（列方向）が示されており、”問い合わせ語の一致”は、問い合わせ語：“郵政民営化ａｎｄ取りまとめ”の各文書内でのマッチング頻度によるランキング、“話題の豊富さ”は前記スコア基準１によるランキング、“異なる内容”は前記スコア基準２によるランキングであり、中心から左に行くほどランキング値が高くなっている。
なお、本実施形態において、”問い合わせ語の一致”のランキング結果は、明示しない全文検索システム：Ｎａｍａｚｕ（http://www.namazu.org）を用いて算出している。
例えば、“問い合わせ語の一致”は文書１から文書４までともに同じランキング値であるが、“話題の豊富さ”では前記のとおり、文書３、文書４、文書１、文書２の順にランキングされる。従って、図２０によれば、ユーザは多面的な基準でのランキングを一覧でき、所望の文書の閲覧を支援できるといえる。 <Result presentation process>
For example, FIG. 20 shows a display example of ranking of the documents 1 to 4.
In FIG. 20, ranking (column direction) from document 1 to document 4 (row direction) is shown, and “matching query word” is a query word: “postal privatization and summary” in each document. The ranking based on the matching frequency, “Topic abundance” is the ranking based on the score criterion 1, and “Different contents” is the ranking based on the score criterion 2, and the ranking value increases from the center to the left.
In the present embodiment, the ranking result of “query word match” is calculated by using a full-text search system: Namazu (http://www.namazu.org) not explicitly shown.
For example, “query word match” has the same ranking value from document 1 to document 4, but “topic richness” ranks in the order of document 3, document 4, document 1, document 2 as described above. . Therefore, according to FIG. 20, it can be said that the user can list rankings based on a multifaceted standard and can support browsing of a desired document.

また、図２１〜図２３に、図２０において前記３つのランキング基準で、文書３を選択した場合の表示例を示す。
図２１は“問い合わせの一致”基準での文書３であり、この基準における重要部として問い合わせ語にマッチングした部分が強調表示されている。また、図２２は“話題の豊富さ”基準での文書３であり、この基準における重要部、即ち検索された文書群内での共通話題部分、として前記類似単文連接集合群を構成するノードに含まれる単文が強調表示されている。また、図２３は“異なる内容”基準での文書３であり、この基準における重要部、即ち検索された文書群内での共通話題以外の部分、として前記類似単文連接集合群を構成するノードに含まれない単文が強調表示されている。これにより、ユーザは、各基準における文書内での重要部を閲覧することもできる。 21 to 23 show display examples when the document 3 is selected based on the three ranking criteria in FIG.
FIG. 21 shows the document 3 based on the “matching inquiry” criterion, and a portion matching the query word is highlighted as an important part in this criterion. FIG. 22 shows the document 3 based on the “buzz abundance” criterion. As the important part in this criterion, that is, the common topic portion in the retrieved document group, the nodes constituting the similar single sentence concatenated set group are shown. The included single sentence is highlighted. FIG. 23 shows a document 3 based on the “different contents” criterion. As an important part in this criterion, that is, a part other than the common topic in the retrieved document group, the nodes constituting the similar single sentence connected set group are shown. Single sentences that are not included are highlighted. Thereby, the user can also browse the important part in the document in each standard.

なお、図２２では前記類似単文連接集合群の情報を用いているので、他の文書への参照情報を同様に表示することで、より分析的な閲覧が可能になる。
結果、本発明により、文書の話題の量や密度に関連した基準で文書を順位付けできるランキング方法を提供することで、ユーザが検索結果などの文書群を多面的な観点から閲覧・分析することが可能となる。
なお、前述した情報抽出装置の各機能をコンピュータに実行させるためのプログラムを記録した、コンピュータ読み取り可能なフロッピディスクや光ディスク等の記録媒体を作成することもできる。その記録媒体を汎用のパーソナルコンピュータ等のフロッピィディスク装置やＣＤ−ＲＯＭリーダ等の光ディスク装置に装着して、そこに記録されているプログラムを読み取って内部のハードディスク等の記録装置にインストールさせることにより、この発明による情報抽出装置として機能を持たせることが可能となる。 In FIG. 22, the information of the similar single sentence concatenated set group is used, so that the reference information to other documents is displayed in the same manner, thereby enabling more analytical browsing.
As a result, according to the present invention, by providing a ranking method that can rank documents according to criteria related to the amount and density of document topics, a user can browse and analyze a group of documents such as search results from a multifaceted viewpoint. Is possible.
Note that a computer-readable recording medium such as a floppy disk or an optical disk on which a program for causing a computer to execute each function of the information extraction apparatus described above is recorded can be created. By mounting the recording medium on a floppy disk device such as a general-purpose personal computer or an optical disk device such as a CD-ROM reader, reading the program recorded there and installing it in a recording device such as an internal hard disk, It is possible to provide a function as an information extraction device according to the present invention.

ｇｏｏｇｌｅで検索した結果の一例を示した図である。It is the figure which showed an example of the result searched by Google. 本発明の一実施形態である情報抽出装置を実現するコンピュータの構成例を示した図である。It is the figure which showed the structural example of the computer which implement | achieves the information extraction apparatus which is one Embodiment of this invention. 本実施形態の情報抽出装置が実行する文書ランキング処理を示したフローチャートである。It is the flowchart which showed the document ranking process which the information extraction device of this embodiment performs. 本実施形態の情報抽出装置が実行する他の処理を示したフローチャートである。It is the flowchart which showed the other process which the information extraction apparatus of this embodiment performs. 形態素解析を適用した結果の一例を示した図である。It is the figure which showed an example of the result of applying a morphological analysis. 文書情報テーブルの一例を示した図である。It is the figure which showed an example of the document information table. 文書識別子１の形態素リストテーブルの一例を示した図である。It is the figure which showed an example of the morpheme list table of the document identifier 1. 形態素情報テーブルの一例を示した図である。It is the figure which showed an example of the morpheme information table. 規則を文書１の形態素解析結果に適用することで求められる単文を示した図である。FIG. 10 is a diagram illustrating a simple sentence obtained by applying a rule to a morphological analysis result of document 1. 文書２から文書４においても単文の抽出を実施したときの結果を示した図である。It is the figure which showed the result when extracting a single sentence also in the document 2 to the document 4. FIG. 文書識別子１の単文終端位置リストテーブルを示した図である。It is the figure which showed the single sentence end position list table of the document identifier. 図１０に示す各単文の明示しない形態素頻度ベクトルを（式２）に適用して算出した結果を示した図である。It is the figure which showed the result computed by applying the morpheme frequency vector which does not specify each simple sentence shown in FIG. 10 to (Formula 2). 図１２の単文間類似度の一部の有効行インデックスリストテーブルと列インデックス−値リストテーブルを示した図である。It is the figure which showed the one part effective row index list table and column index-value list table of the similarity between single sentences of FIG. 図１２の単文間類似度行列において、類似度が０．２以上のものは１、０．２以下のものは０としたものを示した図である。In the inter-sentence similarity matrix of FIG. 12, FIG. 13 is a diagram in which similarity is 0.2 or more and 1 is 0.2 or less. 類似単文連接集合群を生成する処理フローを示した図である。It is the figure which showed the processing flow which produces | generates a similar single sentence connection set group. ３つの連結成分の抽出結果を示した図である。It is the figure which showed the extraction result of three connected components. グラフ１とグラフ２の結合結果を示した図である。FIG. 4 is a diagram showing a result of combining graph 1 and graph 2; グラフ１’及びグラフ２の情報を記載した類似単文連接集合群リストテーブルを示した図である。It is the figure which showed the similar single sentence connection set group list table which described the information of the graph 1 'and the graph 2. FIG. 文書スコアリストテーブルを示した図である。It is the figure which showed the document score list table. 文書１から文書４のランキングの表示例を示した図である。FIG. 6 is a diagram illustrating a display example of rankings of documents 1 to 4; 文書３を選択した場合の表示例を示した図である。FIG. 10 is a diagram illustrating a display example when a document 3 is selected. 文書３を選択した場合の表示例を示した図である。FIG. 10 is a diagram illustrating a display example when a document 3 is selected. 文書３を選択した場合の表示例を示した図である。FIG. 10 is a diagram illustrating a display example when a document 3 is selected.

Explanation of symbols

２…キーボード、３…通信Ｉ／Ｏインターフェース、４…ＣＰＵ、５…メモリ、６…ハードディスク、７…ディスプレイ、８…プリンター、１０…インターネット、１１…文書データベース 2 ... Keyboard, 3 ... Communication I / O interface, 4 ... CPU, 5 ... Memory, 6 ... Hard disk, 7 ... Display, 8 ... Printer, 10 ... Internet, 11 ... Document database

Claims

A document input step for entering a document;
A single sentence extraction step of extracting a single sentence from the document;
A single sentence similarity calculation step for calculating a similarity between single sentences extracted by the single sentence extraction step;
A similar single sentence concatenated set group extracting step for extracting a similar single sentence concatenated set group from the similarity relation of the single sentence and the appearance position information in the document of the single sentence;
A document score calculating step for calculating a score of a document based on the similar single sentence connected set group extracted by the similar single sentence connected set group extracting step;
A document ranking method characterized by comprising:

The document ranking method according to claim 1,
Further comprising a morphological analysis step of extracting morphemes from the document;
In the simple sentence extraction step, simple sentences are extracted based on the morpheme information analyzed in the morpheme analysis step,
A document ranking method characterized in that the similarity between single sentences is calculated based on the morpheme information analyzed in the single sentence similarity calculation step.

The document ranking method according to claim 2,
The document score calculating step calculates a score based on the importance of a similar single sentence concatenation set included in the document and a ratio including the similar single sentence concatenation set.

A document ranking method according to claim 3,
A document search step for searching for similar documents;
A search result presentation step for presenting a search result,
The search result presented in the search result presentation step is ranked according to the document ranking method.

A document input means for inputting a document;
Simple sentence extraction means for extracting a simple sentence from the document;
A single sentence similarity calculating means for calculating a similarity between single sentences extracted by the single sentence extracting means;
Similar single sentence concatenated set extraction means for extracting similar single sentence concatenated sets from the single sentence similarity and the appearance position information in the single sentence document;
Document score calculating means for calculating a score of a document based on the similar single sentence connected set group extracted by the similar single sentence connected set group extracting means;
A document ranking apparatus comprising:

The document ranking apparatus according to claim 5, wherein
Further comprising morpheme analysis means for extracting morphemes from the document;
The simple sentence extracting means extracts simple sentences based on the morpheme information analyzed by the morpheme analyzing means, and calculates the similarity between simple sentences based on the morpheme information analyzed by the simple sentence similarity calculating means. Feature document ranking device.

The document ranking apparatus according to claim 6, wherein
The document ranking calculating device, wherein the document score calculating means calculates a score based on the importance of a similar single sentence concatenation set included in the document and a ratio including the similar single sentence concatenation set.

The document ranking apparatus according to claim 7, wherein
It has a document information database,
Document identifier input by the document input unit, morphological analysis result of the document extracted by the morpheme analysis unit, single sentence information of the document extracted by the single sentence extraction unit, calculated by the similarity between single sentences calculation unit Storing the single-sentence similarity between the single sentences, the similar single-sentence connected set group extracted by the similar single-sentence connected set group extracting unit, and the document score calculated by the document score calculating unit in an appropriate format in the document database. Document ranking device characterized by this.

A document ranking device according to claim 8,
A document search means for searching for similar documents, and a search result presentation means for presenting the search results, wherein the search results presented by the search result presentation means are ranked by the document ranking device. Feature document retrieval device.

A recording medium in which a program for executing the document ranking method according to any one of claims 1 to 3 is recorded in a computer-readable format.

A recording medium in which a program for executing the document search method according to claim 4 is recorded in a computer-readable format.