JP2003108571A

JP2003108571A - Document abstraction apparatus, control method of document abstraction apparatus, control program of document abstraction apparatus, and recording medium

Info

Publication number: JP2003108571A
Application number: JP2001304680A
Authority: JP
Inventors: Koji Yamada; 孝司山田
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-09-28
Filing date: 2001-09-28
Publication date: 2003-04-11

Abstract

(57)【要約】【課題】各分野における重要な単語の文章中における
出現頻度が低い場合であっても、各分野に対応させてよ
り正確な要約を作成する。【解決手段】文ベクトル生成部１４は、要約対象の文
章を構成する文の文ベクトルを生成し、文章ベクトル生
成部および重要文生成部に出力する。文章ベクトル生成
部１５は、文ベクトルに基づいて要約対象の文章に対応
する文章ベクトルを生成し、文ベクトル比較部１６に出
力する。文ベクトル比較部１６は、文ベクトル、分野別
特徴ベクトルおよび文章ベクトルに基づいて要約対象の
文章から重要文を抽出する。これらの結果、要約文出力
部１３は、重要文から要約文を生成して出力する。 (57) [Summary] [Problem] To create a more accurate summary corresponding to each field even when the frequency of occurrence of an important word in each field is low in a sentence. A sentence vector generation unit generates a sentence vector of a sentence constituting a sentence to be summarized and outputs the sentence vector to a sentence vector generation unit and an important sentence generation unit. The sentence vector generation unit 15 generates a sentence vector corresponding to the sentence to be summarized based on the sentence vector, and outputs the sentence vector to the sentence vector comparison unit 16. The sentence vector comparison unit 16 extracts an important sentence from the sentence to be summarized based on the sentence vector, the feature vector for each field, and the sentence vector. As a result, the summary sentence output unit 13 generates and outputs a summary sentence from the important sentences.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、文書要約装置、文
書要約装置の制御方法、文書要約装置の制御プログラム
及び記録媒体に係り、特に正確な要約を容易に作成する
ことが可能な文書要約装置、文書要約装置の制御方法、
文書要約装置の制御プログラム及び記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document summarizing device, a control method for the document summarizing device, a control program for the document summarizing device, and a recording medium, and more particularly to a document summarizing device capable of easily creating an accurate summarization. , Control method of document summarizing device,
The present invention relates to a control program for a document summarizing device and a recording medium.

【０００２】[0002]

【従来の技術】従来より、要約対象の文書を入力するこ
とにより、コンピュータなどの情報処理装置を用いて自
動的に要約を作成するものが様々提案されている。例え
ば、特許第２９４４３４６号公報に開示されている要約
装置は、要約対象の文書中の単語の特徴ベクトルに基づ
いて文の特徴ベクトルを生成し、文章中に現れる文と特
徴ベクトルの類似度を用いて要約を作成する構成を採っ
ている。また、特開平１１−１０２３７２に開示されて
いる文書要約装置は、文脈ベクトルの類似度を計算し、
得られた文脈ベクトルの類似度に基づいて要約を作成す
る構成を採っている。2. Description of the Related Art Conventionally, various proposals have been made for automatically creating a summary by inputting a document to be summarized and using an information processing device such as a computer. For example, the summarizing device disclosed in Japanese Patent No. 2944346 generates a feature vector of a sentence based on a feature vector of a word in a document to be summarized, and uses the similarity between the sentence appearing in the sentence and the feature vector. The summary is adopted to create a summary. The document summarizing device disclosed in Japanese Patent Laid-Open No. 11-102372 calculates the similarity of context vectors,
It adopts a configuration that creates a summary based on the similarity of the obtained context vectors.

【０００３】[0003]

【発明が解決しようとする課題】上記従来の技術におい
ては、文章中に出現する単語の頻度情報に基づいて特徴
ベクトルあるいは文脈ベクトルを作成し、要約文を作成
している。従って、当該文章の属する分野（例えば、工
学、経済学、機械、電気、金融など）においては重要な
単語が含まれている場合であっても、たまたま重要な単
語の出現頻度が低いと要約文に用いる単語して選択され
ず、ひいては、当該単語を含む文が要約文に用いられな
くなってしまうという可能性がある。この結果、不正確
な要約文が作成されてしまうという問題点があった。そ
こで、本発明の目的は、各分野における重要な単語の文
章中における出現頻度が低い場合であっても、各分野に
対応させてより正確な要約を作成することが可能な文書
要約装置、文書要約装置の制御方法、文書要約装置の制
御プログラム及び記録媒体を提供することにある。In the above conventional technique, a feature vector or a context vector is created based on the frequency information of words appearing in a text, and a summary text is created. Therefore, even if an important word is included in the field to which the sentence belongs (for example, engineering, economics, machinery, electricity, finance, etc.), it is likely that the important word will occur at a low frequency and the summary sentence There is a possibility that it will not be selected as a word to be used for, and eventually a sentence including the word will not be used as a summary sentence. As a result, there is a problem that an incorrect summary sentence is created. Therefore, an object of the present invention is to provide a document summarizing device and a document capable of creating a more accurate summary corresponding to each field even when the frequency of appearance of an important word in each field in a sentence is low. An object of the present invention is to provide a control method for a summarizing device, a control program for a document summarizing device, and a recording medium.

【０００４】[0004]

【課題を解決するための手段】上記課題を解決するた
め、文書要約装置は、要約対象の文章を構成する文の文
ベクトルを生成する文ベクトル生成部と、前記文ベクト
ルに基づいて要約対象の文章に対応する文章ベクトルを
生成する文章ベクトル生成部と、前記文ベクトル、前記
要約対象の文章が属する分野を特徴づける分野別特徴ベ
クトルおよび前記文章ベクトルに基づいて前記要約対象
の文章から重要文を抽出する重要文抽出部と、を備えた
ことを特徴としている。上記構成によれば、文ベクトル
生成部は、要約対象の文章を構成する文の文ベクトルを
生成し、文章ベクトル生成部および重要文生成部に出力
する。文章ベクトル生成部は、文ベクトルに基づいて要
約対象の文章に対応する文章ベクトルを生成し、重要文
生成部に出力する。重要文生成部は、文ベクトル、分野
別特徴ベクトルおよび前記文章ベクトルに基づいて前記
要約対象の文章から重要文を抽出する。In order to solve the above-mentioned problems, a document summarization apparatus includes a sentence vector generation unit that generates a sentence vector of a sentence that constitutes a sentence to be summarized, and a summarization target of the summarization target based on the sentence vector. A sentence vector generation unit that generates a sentence vector corresponding to a sentence, the sentence vector, an area-specific feature vector that characterizes a field to which the sentence to be summarized belongs, and an important sentence from the sentence to be summarized based on the sentence vector. And an important sentence extracting unit for extracting. According to the above configuration, the sentence vector generation unit generates the sentence vector of the sentence that constitutes the sentence to be summarized, and outputs the sentence vector to the sentence vector generation unit and the important sentence generation unit. The sentence vector generation unit generates a sentence vector corresponding to the sentence to be summarized based on the sentence vector and outputs it to the important sentence generation unit. The important sentence generator extracts an important sentence from the sentence to be summarized based on the sentence vector, the field-specific feature vector, and the sentence vector.

【０００５】この場合において、前記文章ベクトル生成
部は、前記要約対象の文章を構成する文の文ベクトルの
平均ベクトルを前記文章ベクトルとするようにしてもよ
い。また、前記重要文抽出部は、前記文ベクトルと前記
文章との内積である第１の内積を算出し、前記文ベクト
ルと前記分野別特徴ベクトルとの内積である第２の内積
を算出し、前記第１の内積および前記第２の内積の和を
類似度とし、前記類似度を所定の基準類似度と比較する
ことにより前記重要文を抽出するようにしてもよい。さ
らに、前記抽出された重要文に基づいて要約文を作成す
る要約文作成部を備えるようにしてもよい。さらにま
た、前記分野毎に複数の学習用文章に基づいて前記分野
別特徴ベクトルを生成する分野別特徴ベクトル生成部を
備えるようにしてもよい。また、前記分野別特徴ベクト
ル生成部は、各前記学習用文章に対応する文章ベクトル
を生成し、複数の前記学習用文章に対応する文章ベクト
ルの平均ベクトルを前記分野別特徴ベクトルとして生成
するようにしてもよい。In this case, the sentence vector generation unit may set an average vector of sentence vectors of sentences constituting the sentence to be summarized as the sentence vector. Also, the important sentence extraction unit calculates a first inner product which is an inner product of the sentence vector and the sentence, and calculates a second inner product which is an inner product of the sentence vector and the field-specific feature vector, The important sentence may be extracted by using the sum of the first inner product and the second inner product as the similarity and comparing the similarity with a predetermined reference similarity. Further, a summary sentence creating unit that creates a summary sentence based on the extracted important sentence may be provided. Furthermore, a field-specific feature vector generation unit that generates the field-specific feature vector based on a plurality of learning sentences for each field may be provided. Further, the field-specific feature vector generation unit generates a sentence vector corresponding to each of the learning sentences, and an average vector of the sentence vectors corresponding to the plurality of learning sentences is generated as the field-specific feature vector. May be.

【０００６】また、文書要約装置の制御方法は、要約対
象の文章を構成する文の文ベクトルを生成する文ベクト
ル生成過程と、前記文ベクトルに基づいて要約対象の文
章に対応する文章ベクトルを生成する文章ベクトル生成
過程と、前記文ベクトル、前記要約対象の文章が属する
分野を特徴づける分野別特徴ベクトルおよび前記文章ベ
クトルに基づいて前記要約対象の文章から重要文を抽出
する重要文抽出過程と、を備えたことを特徴としてい
る。この場合において、前記文章ベクトル生成過程は、
前記要約対象の文章を構成する文の文ベクトルの平均ベ
クトルを前記文章ベクトルとする過程を備えるようにし
てもよい。In addition, the control method of the document summarizing device includes a sentence vector generation process for generating a sentence vector of a sentence constituting a sentence to be summarized, and a sentence vector corresponding to the sentence to be summarized based on the sentence vector. A sentence vector generation process, the sentence vector, an important sentence extraction process of extracting an important sentence from the summary target sentence based on the field-specific feature vector characterizing the field to which the summary target sentence belongs and the sentence vector; It is characterized by having. In this case, the sentence vector generation process is
A step of setting an average vector of sentence vectors of sentences constituting the sentence to be summarized as the sentence vector may be provided.

【０００７】さらに前記重要文抽出過程は、前記文ベク
トルと前記文章との内積である第１の内積を算出する過
程と、前記文ベクトルと前記分野別特徴ベクトルとの内
積である第２の内積を算出する過程と、前記第１の内積
および前記第２の内積の和を類似度とする過程と、前記
類似度を所定の基準類似度と比較することにより前記重
要文を抽出する過程と、を備えるようにしてもよい。さ
らにまた、前記抽出された重要文に基づいて要約文を作
成する要約文作成過程を備えるようにしてもよい。Further, in the important sentence extracting step, a step of calculating a first inner product which is an inner product of the sentence vector and the sentence, and a second inner product which is an inner product of the sentence vector and the feature-specific vector A step of calculating the sum of the first inner product and the second inner product, and a step of extracting the important sentence by comparing the similarity with a predetermined reference similarity. May be provided. Furthermore, a summary sentence creating step of creating a summary sentence based on the extracted important sentence may be provided.

【０００８】また、コンピュータを、入力された要約対
象文章データに基づいて要約文データを生成させるため
の文書要約装置として機能させるための文書要約装置の
制御プログラムは、前記要約対象文章データに対応する
要約対象の文章を構成する文の文ベクトルを生成させ、
前記文ベクトルに基づいて要約対象の文章に対応する文
章ベクトルを生成させ、前記文ベクトル、前記要約対象
の文章が属する分野を特徴づける分野別特徴ベクトルお
よび前記文章ベクトルに基づいて前記要約対象の文章か
ら重要文を抽出させる、ことを特徴としている。この場
合において、前記要約対象の文章を構成する文の文ベク
トルの平均ベクトルを算出させ、前記文章ベクトルとさ
せるようにしてもよい。A control program of the document summarizing device for causing the computer to function as a document summarizing device for generating the summarized sentence data based on the input summarizing target sentence data corresponds to the summarizing target sentence data. Generate a sentence vector of the sentences that make up the sentence to be summarized,
A sentence vector corresponding to the sentence to be summarized is generated based on the sentence vector, and the sentence vector, the feature vector for each field that characterizes the field to which the sentence to be summarized belongs, and the sentence to be summarized based on the sentence vector It is characterized by extracting important sentences from. In this case, the average vector of the sentence vectors of the sentences constituting the sentence to be summarized may be calculated and used as the sentence vector.

【０００９】さらに前記文ベクトルと前記文章との内積
である第１の内積を算出させ、前記文ベクトルと前記分
野別特徴ベクトルとの内積である第２の内積を算出さ
せ、前記第１の内積および前記第２の内積の和を類似度
とさせ、前記類似度を所定の基準類似度と比較させるこ
とにより前記重要文を抽出させるようにしてもよい。ま
た、前記抽出された重要文に基づいて要約文を作成させ
るようにしてもよい。また、上記各文書要約装置の制御
プログラムを記録媒体に記録してもよい。Further, a first inner product which is an inner product of the sentence vector and the sentence is calculated, and a second inner product which is an inner product of the sentence vector and the feature vector for each field is calculated, and the first inner product is calculated. The important sentence may be extracted by setting the sum of the second inner products as the similarity and comparing the similarity with a predetermined reference similarity. Further, a summary sentence may be created based on the extracted important sentence. Further, the control program of each document summarizing device may be recorded in a recording medium.

【００１０】[0010]

【発明の実施の形態】次に本発明の好適な実施の形態に
ついて図面を参照して説明する。図１に実施形態の文書
要約装置の概要機能構成ブロック図を示す。文書要約装
置１０は、大別すると、形態素解析部１１と、文章解析
部１２と、要約文出力部１３と、文ベクトル生成部１４
と、文章ベクトル生成部１５と、分野別特徴ベクトル生
成部１７と、文ベクトル比較部１６と、分野別特徴ベク
トル辞書１８と、を備えている。ここで、文書要約装置
１０は、コンピュータシステムにおいて実現可能であ
り、形態素解析部１１、文章解析部１２、要約文出力部
１３、文ベクトル生成部１４、文章ベクトル生成部１
５、文ベクトル比較部１６および分野別特徴ベクトル生
成部１７の機能は、各部に対応するマイクロプロセッサ
で実行可能なプログラムによって実現される。また、こ
のようなプログラムは、半導体メモリ、ＣＤ−ＲＯＭな
どの記録媒体から直接実行してもよい。また、外部記憶
装置に予めプログラムインストールして実行することも
可能である。さらにプログラムの実行に先立って実行す
る毎、あるいは、最初に一度だけ、インターネットなど
のネットワークを介してインストールするようにしても
よい。BEST MODE FOR CARRYING OUT THE INVENTION Next, preferred embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows a schematic functional configuration block diagram of the document summarizing apparatus of the embodiment. The document summarization device 10 is roughly classified into a morpheme analysis unit 11, a sentence analysis unit 12, a summary sentence output unit 13, and a sentence vector generation unit 14.
A sentence vector generation unit 15, a field-specific feature vector generation unit 17, a sentence vector comparison unit 16, and a field-specific feature vector dictionary 18. Here, the document summarization device 10 can be realized in a computer system, and includes a morpheme analysis unit 11, a sentence analysis unit 12, a summary sentence output unit 13, a sentence vector generation unit 14, and a sentence vector generation unit 1.
5. The functions of the sentence vector comparison unit 16 and the field-specific feature vector generation unit 17 are realized by a program executable by a microprocessor corresponding to each unit. Further, such a program may be directly executed from a recording medium such as a semiconductor memory or a CD-ROM. It is also possible to install the program in an external storage device in advance and execute it. Furthermore, the program may be installed each time it is executed prior to execution or only once at the beginning via a network such as the Internet.

【００１１】また、要約対象の文章の入力は、キーボー
ド、タブレットなどによる直接入力の他、フレキシブル
ディスク、ハードディスク等の記憶装置からの入力、ス
キャナ及びＯＣＲなどを用いた文字認識入力、インター
ネット、ＬＡＮなどのネットワークを介した通信入力な
どにより行われる。さらに、得られた要約文の出力は、
ＣＲＴなどの表示装置による出力、プリンタなどによる
印字出力、フレキシブルディスク、ハードディスク等の
記憶装置への記録出力、ＬＡＮなどのネットワークを介
した通信出力などにより行われる。また分野別特徴ベク
トル辞書は、実際的には、データベースとして提供さ
れ、当該コンピュータシステムに接続されているハード
ディスク等の記憶装置上あるいはインターネット、ＬＡ
Ｎなどのネットワークを介したデータベースサーバなど
に接続されているハードディスク等の記憶装置上に構成
されている。形態素解析部１１は、入力された要約対象
文章を形態素解析し、単語単位に分割し、名詞、形容動
詞など文ベクトルの生成に必要とされるものを抽出す
る。The input of the text to be summarized is direct input using a keyboard, tablet, etc., input from a storage device such as a flexible disk, a hard disk, character recognition input using a scanner and OCR, the Internet, LAN, etc. It is performed by inputting communication via the network. Furthermore, the output of the obtained summary is
Output is performed by a display device such as a CRT, print output by a printer or the like, recording output to a storage device such as a flexible disk or a hard disk, or communication output via a network such as a LAN. The field-specific feature vector dictionary is actually provided as a database on a storage device such as a hard disk connected to the computer system, the Internet, or LA.
It is configured on a storage device such as a hard disk connected to a database server or the like via a network such as N. The morpheme analysis unit 11 performs morpheme analysis on the input summary target sentence, divides it into word units, and extracts nouns, adjective verbs, and the like required for generating sentence vectors.

【００１２】文章解析部１２は、文ベクトル生成部１
４、文章ベクトル生成部１５および文ベクトル比較部１
６を制御して入力された要約対象文章を構成する複数の
文のうち当該要約対象文章の属する分野において重要度
の高いと思われる文（以下、重要文という。）を要約文
出力部１３に出力することとなる。文ベクトル生成部１
４は、要約対象文章を構成する文毎にベクトル化し、生
成した文ベクトルを文章ベクトル生成部１５に出力す
る。文章ベクトル生成部１５は、入力された文ベクトル
に基づいて要約対象文書の文章ベクトルを生成する。そ
して、文ベクトル比較部１６は、重要文抽出部として機
能し、入力された文ベクトル、生成した文章ベクトルお
よび分野別特徴ベクトル辞書から読み出した要約対象文
書の属する分野の特徴ベクトルに基づいて入力された要
約対象文章を構成する複数の文のうち当該要約対象文章
の属する分野における重要文を抽出し、要約文出力部１
３に出力する。The sentence analysis unit 12 is a sentence vector generation unit 1.
4, sentence vector generation unit 15 and sentence vector comparison unit 1
Among the plurality of sentences that compose the summary target sentence that is controlled by inputting 6, the sentence that is considered to have a high degree of importance in the field to which the summary subject sentence belongs (hereinafter referred to as important sentence) is output to the summary sentence output unit 13. It will be output. Sentence vector generator 1
Reference numeral 4 vectorizes each sentence that constitutes the summary target sentence, and outputs the generated sentence vector to the sentence vector generation unit 15. The text vector generation unit 15 generates a text vector of the summary target document based on the input text vector. Then, the sentence vector comparison unit 16 functions as an important sentence extraction unit, and is input based on the input sentence vector, the generated sentence vector, and the feature vector of the field to which the summary target document belongs read from the field-specific feature vector dictionary. The important sentence in the field to which the relevant summary sentence belongs is extracted from the plurality of sentences constituting the relevant summary sentence and the summary sentence output unit 1
Output to 3.

【００１３】分野別特徴ベクトル生成部１７は、文書要
約装置１０の運用に先立って分野別特徴ベクトル辞書１
８に格納すべき、分野別特徴ベクトルを生成するもので
ある。この場合において、文書要約装置１０の運用管理
者は、当該文書要約装置１０を利用する要約対象文章の
属する分野に含まれることが分かっている文章である学
習用文章を各分野ごとに所定数だけ用意する。この場合
において用意すべき文章の数は、経験的に定めるもので
あり、統計学的に有意に分野ごとの特徴が現れるのに十
分な数とする。次に学習用文章から文章ベクトルを生成
する。ベクトル化の具体的な手法としては、例えば、出
現する単語を次元とするベクトルを用いる方法がある。
ベクトルの要素の値は、ＴＦＩＤＦ値（単語の出現頻度
および文章中における単語の特徴量から計算される値）
を用いる。The field-specific feature vector generating section 17 is provided with a field-specific feature vector dictionary 1 prior to the operation of the document summarizing apparatus 10.
8 is to generate a field-specific feature vector to be stored in the field 8. In this case, the operation manager of the document summarization device 10 determines that a predetermined number of learning sentences, which are sentences that are known to be included in the field to which the summarization target sentence that uses the document summarization device 10 belongs, for each field. prepare. In this case, the number of sentences to be prepared is empirically determined, and is a sufficient number so that the characteristics of each field are statistically significant. Next, a sentence vector is generated from the learning sentence. As a specific vectorization method, for example, there is a method of using a vector having an appearing word as a dimension.
The value of the element of the vector is a TFIDF value (a value calculated from the appearance frequency of the word and the feature amount of the word in the sentence)
To use.

【００１４】例えば、複数の文章が格納されたデータベ
ース内の文章ｄ１のベクトルＤ1（ｄ１）は次の式で表
現できる。 D1(d1)=(TF(d1,t1)*IDF(t1), TF(d1,t2)*IDF(t2), TF(d1,t3)*IDF(t3), ……, TF(d1,tn)*IDF(tn)) ここで、t1、t2、t3、……、tnは、それぞれデータベー
ス内に出現する単語であり、t1〜tnは、データベース内
の全単語に相当する。そして各分野毎に、例えば学習用
文章のベクトル値の平均を求め、得られた平均ベクトル
値に対応する平均ベクトルを各分野の分野別特徴ベクト
ルとして分野別特徴ベクトル辞書１８に格納する。具体
的には、図２に示すように、分野Ａに属する複数の学習
用文章に対応する複数の文章ベクトルを得る。同様に分
野Ｂに属する複数の学習用文章に対応する複数の文章ベ
クトルを得る。なお、図２においては、図示の簡略化の
ため各分野における複数の文章ベクトルについて、２次
元上で表現し、各文章ベクトルが属する分野を表す符号
（ＡまたはＢ）を含む矩形として図示している。しかし
ながら、実際には、各文章ベクトルは、第１の単語のＴ
ＦＩＤＦ値を第１軸とし、第２の単語のＴＦＩＤＦ値を
第２軸とし、……、第Ｎの単語のＴＦＩＤＦ値を第Ｎ軸
として表したＮ次元のベクトルとして表される。ここ
で、Ｎは対象の全文書に出現する単語数に相当する。そ
して、得られた複数の文章ベクトルを各分野毎に平均
し、得られた平均ベクトルを各分野の分野別特徴ベクト
ルとする。図２においては、分野Ａの分野別特徴ベクト
ルＶＣＡおよび分野Ｂの分野別特徴ベクトルＶＣＢを表
示している。For example, the vector D1 (d1) of the sentence d1 in the database storing a plurality of sentences can be expressed by the following equation. D1 (d1) = (TF (d1, t1) * IDF (t1), TF (d1, t2) * IDF (t2), TF (d1, t3) * IDF (t3), ......, TF (d1, tn ) * IDF (tn)) where t1, t2, t3, ..., Tn are words that appear in the database, and t1 to tn correspond to all words in the database. Then, for each field, for example, the average of vector values of learning sentences is calculated, and the average vector corresponding to the obtained average vector value is stored in the field-specific feature vector dictionary 18 as the field-specific feature vector of each field. Specifically, as shown in FIG. 2, a plurality of sentence vectors corresponding to a plurality of learning sentences belonging to the field A are obtained. Similarly, a plurality of sentence vectors corresponding to a plurality of learning sentences belonging to the field B are obtained. Note that, in FIG. 2, a plurality of text vectors in each field are two-dimensionally represented for simplification of the drawing, and are illustrated as a rectangle including a code (A or B) indicating a field to which each text vector belongs. There is. However, in reality, each sentence vector is the T of the first word.
It is represented as an N-dimensional vector with the FIDF value as the first axis, the TFIDF value of the second word as the second axis, ..., And the TFIDF value of the Nth word as the Nth axis. Here, N corresponds to the number of words that appear in all target documents. Then, the obtained plurality of text vectors are averaged for each field, and the obtained average vector is set as a field-specific feature vector of each field. In FIG. 2, the field-specific feature vector VCA of the field A and the field-specific feature vector VCB of the field B are displayed.

【００１５】実際に各分野の特徴ベクトルを分野別特徴
ベクトル辞書１８にデータベースとして格納する場合に
は、図３に示すように、各分野ごとに、当該分野を特定
するための分野ＩＤデータおよび当該分野を表す単語Ｔ
ＦＩＤＦデータをＴＦＩＤＦの値が大きい順（出現頻度
が高い順）にｎ個づつ含むデータベースとして構成して
いる。より詳細には、分野ＩＤデータ＝１の分野につい
ては、当該分野において最もＴＦＩＤＦの値が高い単語
である「パソコン」および対応するＴＦＩＤＦの値
「０．０００１」を表すＴＦＩＤＦデータから順番に、
単語「ソフトウェア」までｎ個の単語ＴＦＩＤＦデータ
を格納している。同様に分野ＩＤデータ＝２の分野につ
いては、当該分野において最もＴＦＩＤＦの値が高い単
語である「メール」および対応するＴＦＩＤＦの値
「０．０００１５」を表すＴＦＩＤＦデータから順番
に、単語「送信」までｎ個の単語ＴＦＩＤＦデータを格
納している。When actually storing the feature vector of each field in the field-specific feature vector dictionary 18 as a database, as shown in FIG. 3, field ID data for specifying the field and the field ID data are specified for each field. Word T for field
It is configured as a database including n pieces of FIDF data in descending order of TFIDF value (in descending order of appearance frequency). More specifically, for the field of field ID data = 1, in order from the word “PC”, which is the word with the highest TFIDF value in the field, and the corresponding TFIDF data representing the value of “0.0001” of the TFIDF,
Up to the word "software", n word TFIDF data are stored. Similarly, for the field of field ID data = 2, the word “send” is sequentially provided from the TMAIL data having the highest TFIDF value in the field, “mail” and the corresponding TFIDF value “0.00015”. Up to n words TFIDF data are stored.

【００１６】次に図４の文章解析部の処理フローチャー
トを参照して文章要約装置の動作を文章解析部の動作を
中心として説明する。形態素解析部１１は、入力された
要約対象文章を形態素解析し、単語単位に分割し、名
詞、形容動詞など文ベクトルの生成に必要とされるもの
を抽出して文章解析部１２に出力する。文章解析部１２
は、入力された要約対象文章を構成している文を取り出
す（ステップＳ１１）。そして要約対象文章を構成する
文を全て取り出したか否かを判別する（ステップＳ１
２）。ステップＳ１２の判別において、いまだ要約対象
文章を構成する文を全て取り出してはいない場合には
（ステップＳ１２；Ｎｏ）、取り出した文を文ベクトル
生成部１４に送り、文ベクトルを生成させる。Next, the operation of the sentence summarizing device will be described focusing on the operation of the sentence analyzing unit with reference to the processing flow chart of the sentence analyzing unit of FIG. The morphological analysis unit 11 morphologically analyzes the input summary target sentence, divides it into word units, extracts nouns, adjective verbs, and the like required for generating a sentence vector, and outputs them to the sentence analysis unit 12. Sentence analysis unit 12
Retrieves the sentences that make up the input sentence to be summarized (step S11). Then, it is determined whether or not all the sentences forming the summary target sentence have been extracted (step S1).
2). In the determination in step S12, if all the sentences that form the summary target sentence have not yet been extracted (step S12; No), the extracted sentence is sent to the sentence vector generation unit 14 to generate a sentence vector.

【００１７】ここで、図５の文ベクトル生成部の処理フ
ローチャートを参照して文ベクトル生成部１４の文ベク
トル生成動作について説明する。文ベクトル生成部１４
は、形態素解析部１１により入力された文ベクトル生成
に必要とする名詞、形容動詞等を取得する（ステップＳ
２１）。そして文ベクトル生成部１４は、形態素解析に
より得られた単語のＴＦＩＤＦを計算し、文をベクトル
化する（ステップＳ２２）。具体的なＴＦＩＤＦの計算
は次式に基づいて行う。ＴＦＩＤＦ = TF(d,t) × IDF(t) ここで、TF(d,t)：テキストd における単語t の出現頻
度 IDF(t)：log[DB(db) / f(t,db)] DB(db)：あるデータベースdb 内に格納されている全テ
キスト数 f(t,db)：あるデータベースdb 内で単語t が出現するテ
キスト数である。Here, the sentence vector generation operation of the sentence vector generation unit 14 will be described with reference to the processing flow chart of the sentence vector generation unit of FIG. Sentence vector generation unit 14
Acquires a noun, adjective verb, etc., required for generating a sentence vector input by the morphological analysis unit 11 (step S
21). Then, the sentence vector generation unit 14 calculates the TFIDF of the word obtained by the morphological analysis and vectorizes the sentence (step S22). The specific calculation of TFIDF is performed based on the following equation. TFIDF = TF (d, t) × IDF (t) where TF (d, t): Frequency of appearance of word t in text d IDF (t): log [DB (db) / f (t, db)] DB (db): Total number of texts stored in a certain database db f (t, db): Number of texts in which a word t appears in a certain database db.

【００１８】具体的な計算の一例として、百人一首にお
いて、単語ｔ＝「月」に対するテキストｄである23番の
歌のテキストのＴＦＩＤＦ値を計算する場合について説
明する。。テキストｄ＝「月みればちぢに物こそかなしけれわ
が身ひとつの秋にはあらねど」この23番の歌のテキストｄにおいて、単語ｔ＝「月」の
出現頻度TF(d,t) は１（回）である。また、百人一首の
全部の歌の数（全テキスト数DB(db)）は１００であ
り、百人一首全体における単語ｔ＝「月」を含む歌の数
（出現頻度f(t,db)）は１１である。従って、単語ｔ＝
「月」に対するテキストｄ＝「月みればちぢに物こそ
かなしけれわが身ひとつの秋にはあらねど」のＴＦ
ＩＤＦ値は、以下のような結果となる。 As an example of a specific calculation, a case where the TFIDF value of the text of the 23rd song, which is the text d for the word t = “month”, is calculated at the Hyakunin Isshu will be described. . Text d = "If you look at the moon you can't do anything but fall in my own fall" In the text d of this 23rd song, the word t = "month" frequency of occurrence TF (d, t) is 1 (Times). In addition, the number of all songs of Hyakunin Isshu (total text number DB (db)) is 100, and the number of songs including the word t = “month” in all Hyakunin Isshu (appearance frequency f (t, db)) is 11. is there. Therefore, the word t =
TF for the text "Moon" d = "If you see the moon, things are small, but in my own fall I don't know"
The IDF value has the following results.

【００１９】ステップＳ１２の判別において、既に要約
対象文章を構成する文を全て取り出した場合には（ステ
ップＳ１２；Ｙｅｓ）、文章解析部１２は、ステップＳ
１３の処理において得られた全ての文ベクトルを文章ベ
クトル生成部１５に送り、文章ベクトルを生成させる
（ステップＳ１４）。ここで、図６の文章ベクトル生成
部１５の処理フローチャートを参照して文章ベクトル生
成動作について説明する。文章ベクトル生成部１５は、
文ベクトル生成部１４により生成された文ベクトルの平
均値を計算する（ステップＳ３１）。次に文章ベクトル
生成部１５は、計算した文ベクトルの平均値（文ベクト
ルの平均ベクトル）を文章ベクトルとして文章解析部１
２に出力する（ステップＳ３２）。次に文章解析部１２
は、ステップＳ１３の処理において得られた文ベクトル
およびステップＳ１４の処理において得られた文章ベク
トルを文ベクトル比較部１６に送り、重要文を抽出させ
る（ステップＳ１５）。In the determination of step S12, when all the sentences that form the summary target sentence have already been extracted (step S12; Yes), the sentence analysis unit 12 causes the sentence analysis unit 12 to execute step S12.
All sentence vectors obtained in the process of 13 are sent to the sentence vector generation unit 15 to generate sentence vectors (step S14). Here, the text vector generation operation will be described with reference to the processing flowchart of the text vector generation unit 15 in FIG. The sentence vector generation unit 15
The average value of the sentence vectors generated by the sentence vector generation unit 14 is calculated (step S31). Next, the sentence vector generation unit 15 uses the calculated average value of the sentence vectors (average vector of sentence vectors) as the sentence vector, and the sentence analysis unit 1
2 (step S32). Next, the sentence analysis unit 12
Sends the sentence vector obtained in the process of step S13 and the sentence vector obtained in the process of step S14 to the sentence vector comparison unit 16 to extract an important sentence (step S15).

【００２０】ここで、図７の文ベクトル比較部１６の処
理フローチャートを参照して要約作成に用いる重要文の
抽出動作について説明する。文ベクトル比較部１６は、
文章解析部１２から送られた文ベクトルのうちから順に
文ベクトルを一つ取り出すための処理を行う（ステップ
Ｓ４１）。次に文ベクトル比較部１６は、ステップＳ４
１において取り出すべき文ベクトルがもう無いか否かを
判別する（ステップＳ４２）。ステップＳ４２の判別に
おいて、取り出すべき文ベクトルがもう無い場合には
（ステップＳ４２；Ｙｅｓ）、文ベクトル比較部１６
は、全ての重要文抽出動作が終了したので処理を終了す
る。Here, the operation of extracting the important sentence used for creating the abstract will be described with reference to the processing flow chart of the sentence vector comparison unit 16 of FIG. The sentence vector comparison unit 16
A process for extracting one sentence vector from the sentence vectors sent from the sentence analysis unit 12 is performed (step S41). Next, the sentence vector comparison unit 16 performs step S4.
It is determined whether or not there is any sentence vector to be retrieved in 1 (step S42). In the determination of step S42, if there is no sentence vector to be extracted (step S42; Yes), the sentence vector comparison unit 16
Ends all the important sentence extraction operations, so the processing ends.

【００２１】ステップＳ４２の判別において、取り出す
べき文ベクトルがあった場合には（ステップＳ；Ｎ
ｏ）、文ベクトル比較部１６は、取り出した文ベクト
ル、あらかじめユーザにより指定された要約対象文章が
属する分野に対応する特徴ベクトルおよび要約対象文章
の文章ベクトルに基づいて類似度を計算する（ステップ
Ｓ４３）。具体的な類似度の計算手順としては、まず文
ベクトルと文章ベクトルとの内積（第１の内積）を求め
る。次に文ベクトルと分野に対応する特徴ベクトルとの
内積（第２の内積）を求める。そして、第１の内積と第
２の内積の和を類似度とする。If there is a sentence vector to be extracted in the determination in step S42 (step S; N
o), the sentence vector comparison unit 16 calculates the similarity on the basis of the extracted sentence vector, the feature vector corresponding to the field to which the summary target sentence previously designated by the user belongs, and the sentence vector of the summary target sentence (step S43). ). As a concrete calculation procedure of the degree of similarity, first, the inner product (first inner product) of the sentence vector and the sentence vector is obtained. Next, the inner product (second inner product) of the sentence vector and the feature vector corresponding to the field is obtained. Then, the sum of the first inner product and the second inner product is set as the similarity.

【００２２】続いて、文ベクトル比較部１６は、類似度
があらかじめ設定された一定値（重要文か否かを定める
ための基準類似度に相当する値）以上であるか否かを判
別する（ステップＳ４４）。ステップＳ４４の判別にお
いて、類似度が一定値未満である場合には（ステップＳ
４４；Ｎｏ）、ステップＳ４３において類似度の算出に
用いた文ベクトルに対応する文章は、要約作成における
重要文ではないので、処理をステップＳ４１に移行して
全ての文ベクトルに対する処理が終了するまで以下同様
の処理を行う。ステップＳ４４の判別において、類似度
が一定値以上である場合には（ステップＳ４４；Ｙｅ
ｓ）、ステップＳ４３において類似度の算出に用いた文
ベクトルに対応する文章は要約作成に用いるべき重要文
として抽出する（ステップＳ４５）。そして処理を再び
ステップＳ４１に移行し、全ての文ベクトルに対する処
理が終了するまで以下同様の処理を繰り返す。これによ
り文章解析部１２は、文ベクトル比較部１６が抽出した
重要分を要約文出力部１３に送出する。これにより要約
文出力部は送信された重要文を接続し、要約文を生成し
て出力して、処理を終了する（ステップＳ１６）。要約
文の具体例については、以下に詳述する。Next, the sentence vector comparison unit 16 determines whether or not the degree of similarity is equal to or greater than a preset constant value (a value corresponding to the reference degree of similarity for determining whether or not the sentence is an important sentence) ( Step S44). In the determination of step S44, if the similarity is less than a certain value (step S
44; No), since the sentence corresponding to the sentence vector used in the calculation of the similarity in step S43 is not an important sentence in the abstract creation, the process proceeds to step S41 until the process for all sentence vectors is completed. The same processing is performed thereafter. In the determination of step S44, if the similarity is equal to or higher than a certain value (step S44; Ye
s), the sentence corresponding to the sentence vector used for the calculation of the degree of similarity in step S43 is extracted as an important sentence to be used for creating the abstract (step S45). Then, the process returns to step S41, and the same process is repeated until the process for all sentence vectors is completed. As a result, the sentence analysis unit 12 sends the important components extracted by the sentence vector comparison unit 16 to the summary sentence output unit 13. As a result, the summary output unit connects the transmitted important sentences, generates and outputs the summary, and ends the processing (step S16). A specific example of the abstract will be described in detail below.

【００２３】次に上記実施形態の手法により得られる要
約文と、従来の文ベクトルの類似度を用いて抽出された
重要文を用いて得られる要約文と、を比較する。以下の
説明においては、特許第２９４４３４６号の明細書に開
示されている従来の技術の部分を取り出したものを要約
対象文章として用いるものとする。この場合において、
元の文章に対し、説明の容易化のため、文番号を付加し
ている。要約対象文章は以下の通りである。「１：発想とは既知の情報の新たな組み合わせであ
り、決して無から有を作り出すことはできない。２：そのために、文書作成時における発想に際しては、
既存の文書を参照して引用することが頻繁に行われる。３：一般に、参考とする既存の文書はその数も多く、個
々の文書中における文章量も多い。４：したがって、この参考とする既存の文書をそのまま
全部読んでいては時間や労力を消費してしまい、本来の
目的である文書作成にかける力が減少してしまう。５：参考とする文書の多さについては、検索装置を用い
て文書内容を絞り込むことによって減らすことができ
る。６：また、個々の文書中における文章量の多さについて
は、要約/要旨抽出装置を用いることによって減少でき
る。７：ここで、個々の文書の文章量を減少させることによ
って参照の手間を軽減するために、文書から要約/要旨
抽出を抽出する場合を考える。８：この場合には、文書の文章量を減少させても元の文
書に含まれる重要な内容が損なわれないような手法を用
いる必要がある。９：従来から提唱されている文書要約の手法としては、
次の２つの手法がある。１０：第１の手法は、文章を表層的に解析するものであ
る。１１：この手法には、単語の出現頻度解析から文章の重
要箇所を決定して元の文書に含まれている単語の組み合
わせや文の抽出によって要約文の生成を行うものや、文
の文末表現および用言によって文章中における強調/主
張文を抽出するものが含まれる。１２：第２の手法は、文章を意味的に解析するものであ
る。１３：この手法には、事前に文章の形式や文脈を仮定し
ておいてその仮定に沿って文章を解析して要約を抽出す
るものや、文の係り受けの粗密性を用いることによって
内容の重要性を定義して要約を抽出するものが含まれ
る。」上記要約対象文章に対して、従来の文ベクトルの類似度
を用いて抽出された重要文は以下の文番号７，８，１３
の３文となる。「７：ここで、個々の文書の文章量を減少させること
によって参照の手間を軽減するために、文書から要約/
要旨抽出を抽出する場合を考える。」「８：この場合には、文書の文章量を減少させても元
の文書に含まれる重要な内容が損なわれないような手法
を用いる必要がある。」「１３：この手法には、事前に文章の形式や文脈を仮
定しておいてその仮定に沿って文章を解析して要約を抽
出するものや、文の係り受けの粗密性を用いることによ
って内容の重要性を定義して要約を抽出するものが含ま
れる。」これに対し、本実施形態の分野別の特徴ベクトルを用い
た手法で抽出された重要文は以下の文番号８，１１，１
３の３文となる。「８：この場合には、文書の文章量を減少させても元
の文書に含まれる重要な内容が損なわれないような手法
を用いる必要がある。」「１１：この手法には、単語の出現頻度解析から文章
の重要箇所を決定して元の文書に含まれている単語の組
み合わせや文の抽出によって要約文の生成を行うもの
や、文の文末表現および用言によって文章中における強
調/主張文を抽出するものが含まれる。」「１３：この手法には、事前に文章の形式や文脈を仮
定しておいてその仮定に沿って文章を解析して要約を抽
出するものや、文の係り受けの粗密性を用いることによ
って内容の重要性を定義して要約を抽出するものが含ま
れる。」Next, the summary sentence obtained by the method of the above-described embodiment is compared with the summary sentence obtained by using the important sentence extracted using the conventional similarity of sentence vectors. In the following description, it is assumed that a part of the conventional technique disclosed in the specification of Japanese Patent No. 2944346 is taken out and used as a text to be summarized. In this case,
A sentence number is added to the original sentence to facilitate explanation. The sentences to be summarized are as follows. "1: An idea is a new combination of known information, and nothing can be created from nothing. Therefore, when thinking when creating a document,
Frequently, citations are made by referring to existing documents. 3: Generally, the number of existing documents to be referred to is large, and the amount of sentences in each document is also large. 4: Therefore, if the entire existing document to be used as the reference is read as it is, it consumes time and labor, and the power to create the document, which is the original purpose, is reduced. 5: The number of documents to be referred to can be reduced by narrowing down the document contents using a search device. 6: Also, the large amount of sentences in each document can be reduced by using the abstract / abstract extracting device. 7: Here, consider a case where abstract / abstract extraction is extracted from a document in order to reduce the effort of reference by reducing the sentence amount of each document. 8: In this case, it is necessary to use a method in which the important content included in the original document is not damaged even if the text amount of the document is reduced. 9: As a method of document summarization that has been conventionally proposed,
There are the following two methods. 10: The first method is to analyze a sentence in a surface layer. 11: In this method, an important part of a sentence is determined from an appearance frequency analysis of words, and a summary sentence is generated by combining words included in the original document and extracting sentences, and sentence end expressions of sentences. It also includes those that extract emphasized / claimed sentences in a sentence by a verb. 12: The second method is to semantically analyze a sentence. 13: In this method, the format and context of a sentence are assumed in advance, the sentence is analyzed according to the assumption, and a summary is extracted. Includes defining significance and extracting summaries. The important sentences extracted by using the similarity of the conventional sentence vector with respect to the above sentence to be summarized are the following sentence numbers 7, 8, 13
Will be three sentences. “7: Here, in order to reduce the reference effort by reducing the sentence volume of each document,
Consider the case of extracting abstract extraction. "8: In this case, it is necessary to use a method that does not impair important contents contained in the original document even if the text amount of the document is reduced." In this paper, we assume the format and context of the sentence and analyze the sentence according to that assumption to extract the abstract, and we also use the coarseness of the dependency of the sentence to define the importance of the content. What is extracted is included. ”On the other hand, the important sentences extracted by the method using the feature vector according to the field of this embodiment are the following sentence numbers 8, 11, and 1.
It will be 3 sentences of 3. "8: In this case, it is necessary to use a method that does not impair important contents included in the original document even if the text amount of the document is reduced." Decide the important part of the sentence from the appearance frequency analysis and generate the summary sentence by combining the words contained in the original document and extracting the sentence, or emphasize in the sentence by the sentence end expression and the idiom Includes extraction of asserted sentences. "" 13: This method presumes the format and context of a sentence, analyzes the sentence according to the assumption, and extracts a summary, and a sentence. Includes those that define the importance of content by using the coarseness of dependency of the content and extract abstracts. ”

【００２４】ここで、各手法により得られる重要文の差
異について説明する。上記従来の手法では、"出現頻
度"、"強調"、"生成"などの単語の出現頻度が文書全体
では少ないため、文番号１１の文が重要文として認識さ
れない。従って、従来の手法で得られる要約文は、以下
のようになる。「ここで、個々の文書の文章量を減少さ
せることによって参照の手間を軽減するために、文書か
ら要約/要旨抽出を抽出する場合を考える。この場合に
は、文書の文章量を減少させても元の文書に含まれる重
要な内容が損なわれないような手法を用いる必要があ
る。この手法には、事前に文章の形式や文脈を仮定して
おいてその仮定に沿って文章を解析して要約を抽出する
ものや、文の係り受けの粗密性を用いることによって内
容の重要性を定義して要約を抽出するものが含まれ
る。」Here, the difference in the important sentence obtained by each method will be described. In the above-mentioned conventional method, since the frequency of appearance of words such as "appearance frequency", "emphasis", and "generation" is low in the entire document, the sentence of sentence number 11 is not recognized as an important sentence. Therefore, the summary sentence obtained by the conventional method is as follows. “Here, consider the case of extracting the abstract / abstract extraction from the document in order to reduce the reference effort by reducing the sentence volume of each document. In this case, the document volume is reduced. However, it is necessary to use a method that does not damage the important contents of the original document.In this method, the format and context of the sentence are assumed in advance, and the sentence is analyzed according to that assumption. And abstracts are extracted, and abstractions are extracted by defining the importance of the content by using the coarseness of the dependency of sentences. ”

【００２５】これに対し、本実施形態の手法では、あら
かじめ、ベクトル空間法を使用した自然言語処理の特許
明細書データを利用して、ベクトル空間法を利用した自
然言語処理という分野の特徴ベクトルを作成している。
従って、分野別特徴ベクトル辞書１８には、この自然言
語処理分野の特徴ベクトルとして、ベクトル空間法を利
用した自然言語処理でよく出現する「出現頻度」、「強
調」、「生成」などの単語に対応する単語ＴＦＩＤＦデ
ータも含まれている。この結果、文番号７の文は、当該
自然言語処理分野においては、重要文としては取り扱わ
れなくなり、これに代わって文番号１１の文が重要文と
して認識されるようになるのである。On the other hand, in the method of this embodiment, the characteristic vector in the field of natural language processing using the vector space method is previously used by utilizing the patent specification data of the natural language processing using the vector space method. Creating.
Therefore, in the field-specific feature vector dictionary 18, words such as “appearance frequency”, “emphasis”, and “generation” that frequently appear in natural language processing using the vector space method are used as feature vectors in the natural language processing field. Corresponding word TFIDF data is also included. As a result, the sentence of sentence number 7 is no longer treated as an important sentence in the natural language processing field, and the sentence of sentence number 11 is recognized as an important sentence instead.

【００２６】この結果、要約文出力部１３から出力され
る要約文は、以下のようになる。「この場合には、文
書の文章量を減少させても元の文書に含まれる重要な内
容が損なわれないような手法を用いる必要がある。この
手法には、単語の出現頻度解析から文章の重要箇所を決
定して元の文書に含まれている単語の組み合わせや文の
抽出によって要約文の生成を行うものや、文の文末表現
および用言によって文章中における強調/主張文を抽出
するものが含まれる。この手法には、事前に文章の形式
や文脈を仮定しておいてその仮定に沿って文章を解析し
て要約を抽出するものや、文の係り受けの粗密性を用い
ることによって内容の重要性を定義して要約を抽出する
ものが含まれる。」As a result, the summary output from the summary output unit 13 is as follows. "In this case, it is necessary to use a method that reduces the text volume of the document so that the important contents contained in the original document are not impaired. A method that determines the important points and extracts words that are included in the original document to generate a summary sentence, and a method that extracts the emphasized / claimed sentence in the sentence by the sentence end expression and the idiom This method is based on the assumption that the form or context of the sentence is preliminarily analyzed and the sentence is analyzed according to the hypothesis to extract a summary, or the sensitivity of sentence dependency is used. It includes defining content importance and extracting abstracts. "

【００２７】以上の説明のように、本実施形態によれ
ば、要約対象の文章が属する分野に特徴的な単語までも
考慮して要約文を作成するため、より正確な要約文を容
易に作成することが可能となる。As described above, according to the present embodiment, the summary sentence is created in consideration of even the words characteristic of the field to which the sentence to be summarized belongs, so that a more accurate summary sentence can be easily created. It becomes possible to do.

【００２８】以下、本実施形態の変形例について説明す
る。以上の説明においては、文章の類似度として文章ベ
クトルの内積を用いる場合について説明したが、文章ベ
クトルのユークリッド距離を文章の類似度として用いる
ように構成することも可能である。上述したように、複
数の文章が格納されたデータベース内の文章ｄ１のベク
トルＤ1は次の式で表現できる。 D1(d1)＝(TF(d1,t1)*IDF(t1), TF(d1,t2)*IDF(t2), TF(d1,t3)*IDF(t3), ……,TF(d1,tn)*IDF(tn))A modified example of this embodiment will be described below. In the above description, the case where the inner product of sentence vectors is used as the sentence similarity is described, but it is also possible to use the Euclidean distance of the sentence vector as the sentence similarity. As described above, the vector D1 of the sentence d1 in the database storing a plurality of sentences can be expressed by the following equation. D1 (d1) ＝ (TF (d1, t1) * IDF (t1), TF (d1, t2) * IDF (t2), TF (d1, t3) * IDF (t3), ......, TF (d1, tn ) * IDF (tn))

【００２９】同様に、文章ｄ２のベクトルＤ2は次の式
で表現できる。 D2(d2)＝(TF(d2,t1)*IDF(t1), TF(d2,t2)*IDF(t2), TF(d2,t3)*IDF(t3), ……, TF(d2,tn)*IDF(tn)) ここで、t1、t2、t3、……、tnは、それぞれデータベー
ス内に出現する単語であり、t1〜tnは、データベース内
の全単語に相当する。これに基づき、文章ｄ１に対応す
る文書ベクトルＤ１（ｄ１）と、文章ｄ２に対応する文
書ベクトルＤ２（ｄ２）の類似度であるユークリッド距
離ＤＥは次の式で計算できる。ＤＥ＝|D1(d1)-D2(d2)|Similarly, the vector D2 of the sentence d2 can be expressed by the following equation. D2 (d2) ＝ (TF (d2, t1) * IDF (t1), TF (d2, t2) * IDF (t2), TF (d2, t3) * IDF (t3), ……, TF (d2, tn ) * IDF (tn)) where t1, t2, t3, ..., Tn are words that appear in the database, and t1 to tn correspond to all words in the database. Based on this, the Euclidean distance DE, which is the similarity between the document vector D1 (d1) corresponding to the sentence d1 and the document vector D2 (d2) corresponding to the sentence d2, can be calculated by the following formula. DE = | D1 (d1) -D2 (d2) |

【００３０】[0030]

【発明の効果】本発明によれば、要約対象の文章につい
て要約文を自動的に生成するに際し、当該要約対象の文
章が属する分野も考慮して要約文を生成するため、より
正確な要約文を容易に作成することが可能となる。According to the present invention, when automatically generating a summary sentence for a sentence to be summarized, the summary sentence is generated in consideration of the field to which the sentence to be summarized belongs. Can be easily created.

[Brief description of drawings]

【図１】実施形態の文書要約装置の概要機能構成ブロ
ック図である。FIG. 1 is a schematic functional configuration block diagram of a document summarizing device according to an embodiment.

【図２】各分野に属する複数の学習用文章に対応する
複数の文章ベクトルの説明図である。FIG. 2 is an explanatory diagram of a plurality of sentence vectors corresponding to a plurality of learning sentences belonging to each field.

【図３】分野別特徴ベクトル辞書内のデータベースの
説明図である。FIG. 3 is an explanatory diagram of a database in a field-specific feature vector dictionary.

【図４】文章解析部の処理フローチャートである。FIG. 4 is a processing flowchart of a sentence analysis unit.

【図５】文スペクトル生成部の処理フローチャートで
ある。FIG. 5 is a processing flowchart of a sentence spectrum generation unit.

【図６】文章ベクトル生成部の処理フローチャートで
ある。FIG. 6 is a processing flowchart of a sentence vector generation unit.

【図７】文ベクトル比較部の処理フローチャートであ
る。FIG. 7 is a processing flowchart of a sentence vector comparison unit.

[Explanation of symbols]

１０……文書要約装置１１……形態素解析部１２……文章解析部１３……要約文出力部１４……文ベクトル生成部１５……文章ベクトル生成部１６……文ベクトル比較部（重要文抽出部）１７……分野別特徴ベクトル生成部１８……分野別特徴ベクトル辞書 10 ... Document summarizing device 11 ... Morphological analysis unit 12 …… Sentence analysis section 13 ... Summary output section 14 ... Sentence vector generation unit 15 …… Sentence vector generator 16: Sentence vector comparison unit (important sentence extraction unit) 17 ... Field-specific feature vector generator 18: Feature vector dictionary by field

Claims

[Claims]

1. A sentence vector generation unit that generates a sentence vector of a sentence that constitutes a sentence to be summarized, a sentence vector generation unit that generates a sentence vector corresponding to the sentence to be summarized based on the sentence vector, A sentence vector, an area-specific feature vector that characterizes a field to which the summary target sentence belongs, and an important sentence extraction unit that extracts an important sentence from the summary target sentence based on the sentence vector. Document summarization device.

2. The document summarization apparatus according to claim 1, wherein the sentence vector generation unit sets an average vector of sentence vectors of sentences constituting the sentence to be summarized as the sentence vector. apparatus.

3. The document summarizing device according to claim 1, wherein the important sentence extraction unit calculates a first inner product which is an inner product of the sentence vector and the sentence, and the sentence vector and the field-specific feature vector. A second inner product that is an inner product of the first inner product and the second inner product is defined as a similarity, and the important sentence is extracted by comparing the similarity with a predetermined reference similarity. A document summarizing device characterized by:

4. The document summarizing apparatus according to claim 1, further comprising a summary sentence creating unit that creates a summary sentence based on the extracted important sentence.

5. The document summarizing apparatus according to claim 1, further comprising a field-specific feature vector generation unit that generates the field-specific feature vector based on a plurality of learning sentences for each field. Summarization device.

6. The document summarizing apparatus according to claim 5, wherein the field-specific feature vector generation unit generates a sentence vector corresponding to each of the learning sentences, and generates a sentence vector corresponding to the plurality of learning sentences. A document summarization device, wherein an average vector is generated as the field-specific feature vector.

7. A sentence vector generation process for generating a sentence vector of a sentence constituting a summary target sentence; a sentence vector generation process for generating a sentence vector corresponding to the summary target sentence based on the sentence vector; A sentence vector, a field-specific feature vector that characterizes a field to which the sentence to be summarized belongs, and an important sentence extraction process of extracting an important sentence from the sentence to be summarized based on the sentence vector. Control method for document summarization device.

8. The control method of the document summarizing apparatus according to claim 7, wherein the sentence vector generating step includes a step of setting an average vector of sentence vectors of sentences constituting the sentence to be summarized as the sentence vector. A method of controlling a document summarizing device, comprising:

9. The control method of the document summarizing apparatus according to claim 7, wherein the important sentence extracting step includes a step of calculating a first inner product which is an inner product of the sentence vector and the sentence; Calculating a second inner product that is an inner product with the field-specific feature vector; using a sum of the first inner product and the second inner product as a similarity; and using the similarity as a predetermined reference similarity. And a step of extracting the important sentence by comparing with the above.

10. The control method of the document summarizing device according to claim 7, further comprising a summary sentence creating step of creating a summary sentence based on the extracted important sentence. .

11. A control program of a document summarization device for causing a computer to function as a document summarization device for generating summary text data based on input summarization target text data, which corresponds to the summarization target text data. Generate a sentence vector of a sentence that constitutes a summary target sentence, generate a sentence vector corresponding to the summary target sentence based on the sentence vector, and characterize the sentence vector, the field to which the summary target sentence belongs A control program for a document summarizing device, wherein an important sentence is extracted from the sentence to be summarized based on another feature vector and the generated sentence vector.

12. The control program for the document summarizing apparatus according to claim 11, wherein an average vector of sentence vectors of sentences constituting the sentence to be summarized is calculated and used as the sentence vector. Device control program.

13. The control program of the document summarizing apparatus according to claim 11, wherein a first inner product which is an inner product of the sentence vector and the sentence is calculated, and a first inner product of the sentence vector and the field-specific feature vector is calculated. Calculating a certain second inner product, making the sum of the first inner product and the second inner product a similarity, and extracting the important sentence by comparing the similarity with a predetermined reference similarity. A control program for a document summarizing device.

14. The control program of the document summarizing device according to claim 11, wherein the control program of the document summarizing device is configured to create a summary sentence based on the extracted important sentence.

15. A recording medium on which the control program of the document summarizing device according to claim 11 is recorded.