JP4345129B2

JP4345129B2 - Document processing method and apparatus, and recording medium

Info

Publication number: JP4345129B2
Application number: JP10065399A
Authority: JP
Inventors: 確長尾
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1999-04-07
Filing date: 1999-04-07
Publication date: 2009-10-14
Anticipated expiration: 2019-04-07
Also published as: JP2000293533A

Description

【０００１】
【発明の属する技術分野】
本発明は、電子文書を処理する文書処理方法および装置ならびに電子文書を処理する文書処理プログラムが記録された記録媒体に関する。
【０００２】
【従来の技術】
従来、インターネットにおいて、ウィンドウ形式でハイパーテキスト型情報を提供するアプリケーションサービスとしてＷＷＷ（World Wide Web）が提供されている。
【０００３】
ＷＷＷは、文書の作成、公開または共有化の文書処理を実行し、新しいスタイルの文書の在り方を示したシステムである。しかし、文書の実際上の利用の観点からは、文書の内容に基づいた文書の分類や要約といった、ＷＷＷを越える高度な文書処理が求められている。このような高度な文書処理には、文書の内容の機械的な処理が不可欠である。
【０００４】
しかしながら、文書の内容の機械的な処理は、以下のような理由から依然として困難である。第１に、ハイパーテキストを記述する言語であるＨＴＭＬ（Hyper Text Markup Language）は、文書の表現については規定するが、文書の内容についてはほとんど規定しない。第２に、文書間に構成されたハイパーテキストのネットワークは、文書の読者にとって文書の内容を理解するために必ずしも利用しやすいものではない。第３に、一般に文章の著作者は読者の便宜を念頭に置かずに著作するが、文書の読者の便宜が著作者の便宜と調整されることはない。
【０００５】
このように、ＷＷＷは新しい文書の在り方を示したシステムであるが、文書を機械的に処理しないので、高度な文書処理をおこなうことができなかった。換言すると、高度な文書処理を実行するためには、文書を機械的に処理することが必要となる。
【０００６】
そこで、文書の機械的な処理を目標として、文書の機械的な処理を支援するシステムが自然言語研究の成果に基づいて開発されている。自然言語研究による文書処理として、文書の著作者等による文書の内部構造についての属性情報、いわゆるタグの付与を前提とした、文書に付与されたタグを利用する機械的な文書処理が提案されている。
【０００７】
【発明が解決しようとする課題】
ところで、近年のコンピュータの普及や、ネットワーク化の進展に伴い、文章処理や、文書の内容に依存した索引などで、テキスト文書の作成、ラベル付け、変更などをおこなう文書処理の高機能化が求められている。たとえば、ユーザの要望に応じた文書の要約や、文書の分類等が望まれる。
【０００８】
本発明は、上述の実情に鑑みて提案されるものであって、文書に対するユーザの関心度を算出するような文書処理方法および装置、ならびに文書に対するユーザの関心度を算出するような文書処理プログラムが記録されてなる記録媒体に関する。
【０００９】
【課題を解決するための手段】
上述の課題を解決するために、本発明に係る文書処理方法は、複数の電子文書を処理する文書処理装置の文書処理方法において、受信手段が、複数の電子文書を受信する受信工程と、記録手段が、上記受信工程にて受信された複数の電子文書を記録する記録工程と、表示手段が、上記記録工程にて記録された複数の電子文書のうちユーザによって選択された電子文書及び当該電子文書の要約を表示する表示工程と、入力手段が、上記表示工程にて表示された電子文書に対するユーザの操作情報を入力する入力工程と、実関心度検出手段が、上記表示工程にて表示された電子文書に対して上記入力工程にて入力されたユーザの操作情報に基づいて上記電子文書に対するユーザの実関心度を算出する実関心度検出工程と、優先順位設定手段が、上記実関心度が算出されていない電子文書に対し、上記実関心度検出工程にて実関心度が算出された電子文書のうち上記電子文書の内部構造に基づく関連度が最も高い電子文書の実関心度を予測関心度とし、当該予測関心度に基づいて優先順位を設定する優先順位設定工程と、並べ替え手段が、上記記録工程にて記録された複数の電子文書のうちユーザによって選択されていない電子文書を上記優先順位設定工程にて設定された優先順位に応じて並べ替える並べ替え工程とを有し、上記実関心度検出工程では、上記電子文書及び上記電子文書の要約をそれぞれ表示する各表示領域について、ユーザによって選択された上記電子文書の要素のうち上記電子文書中での出現位置が上記電子文書の先頭から最も遠い位置にある要素の出現位置と上記電子文書のサイズとの比率からなる第１の実関心度要素と、ユーザによって指定されたキーワード数及びユーザによって選択された要素数からなる上記実関心度の第２の実関心度要素と、上記表示工程にて表示される上記電子文書の要約の表示領域のサイズと上記電子文書の表示領域のサイズとの比率からなる第３の上記実関心度要素とを用いて上記実関心度を算出する。
【００１０】
本発明に係る文書処理装置は、複数の電子文書を処理する文書処理装置において、複数の電子文書を受信する受信手段と、上記受信手段によって受信された複数の電子文書を記録する記録手段と、上記記録手段によって記録された複数の電子文書のうちユーザによって選択された電子文書及び当該電子文書の要約を表示する表示手段と、上記表示手段によって表示された電子文書に対するユーザの操作情報を入力する入力手段と、上記表示手段によって表示された電子文書に対して上記入力手段によって入力されたユーザの操作情報に基づいて上記電子文書に対するユーザの実関心度を算出する実関心度検出手段と、上記実関心度が算出されていない電子文書に対し、上記実関心度検出手段によって実関心度が算出された電子文書のうち上記電子文書の内部構造に基づく関連度が最も高い電子文書の実関心度を予測関心度とし、当該予測関心度に基づいて優先順位を設定する優先順位設定手段と、上記記録手段によって記録された複数の電子文書のうちユーザによって選択されていない電子文書を上記優先順位設定手段によって設定された優先順位に応じて並べ替える並べ替え手段とを備え、上記実関心度検出手段は、上記電子文書及び上記電子文書の要約をそれぞれ表示する各表示領域について、ユーザによって選択された上記電子文書の要素のうち上記電子文書中での出現位置が上記電子文書の先頭から最も遠い位置にある要素の出現位置と上記電子文書のサイズとの比率からなる第１の実関心度要素と、ユーザによって指定されたキーワード数及びユーザによって選択された要素数からなる上記実関心度の第２の実関心度要素と、上記表示手段によって表示される上記電子文書の要約の表示領域のサイズと上記電子文書の表示領域のサイズとの比率からなる第３の上記実関心度要素とを用いて上記実関心度を算出する。
【００１１】
本発明に係る記録媒体は、複数の電子文書を処理する文書処理をコンピュータに実行させる文書処理プログラムが記録されたコンピュータが読み取り可能な記録媒体において、上記文書処理プログラムは、受信手段が、複数の電子文書を受信する受信工程と、上記受信工程にて受信された複数の電子文書を記録する記録工程と、表示手段が、上記記録工程にて記録された複数の電子文書のうちユーザによって選択された電子文書及び当該電子文書の要約を表示する表示工程と、入力手段が、上記表示工程にて表示された電子文書に対するユーザの操作情報を入力する入力工程と、実関心度検出手段が、上記表示工程にて表示された電子文書に対して上記入力工程にて入力されたユーザの操作情報に基づいて上記電子文書に対するユーザの実関心度を算出する実関心度検出工程と、優先順位設定手段が、上記実関心度が算出されていない電子文書に対し、上記実関心度検出工程にて実関心度が算出された電子文書のうち上記電子文書の内部構造に基づく関連度が最も高い電子文書の実関心度を予測関心度とし、当該予測関心度に基づいて優先順位を設定する優先順位設定工程と、並べ替え手段が、上記記録工程にて記録された複数の電子文書のうちユーザによって選択されていない電子文書を上記優先順位設定工程にて設定された優先順位に応じて並べ替える並べ替え工程とをコンピュータに実行させ、上記実関心度検出工程では、上記電子文書及び上記電子文書の要約をそれぞれ表示する各表示領域について、ユーザによって選択された上記電子文書の要素のうち上記電子文書中での出現位置が上記電子文書の先頭から最も遠い位置にある要素の出現位置と上記電子文書のサイズとの比率からなる第１の実関心度要素と、ユーザによって指定されたキーワード数及びユーザによって選択された要素数からなる上記実関心度の第２の実関心度要素と、上記表示工程にて表示される上記電子文書の要約の表示領域のサイズと上記電子文書の表示領域のサイズとの比率からなる第３の上記実関心度要素とを用いて上記実関心度を算出するものである。
【００１２】
【発明の実施の形態】
以下、図面を参照して、本発明に係る文書処理方法および装置ならびに記録媒体の実施の形態について説明する。
【００１３】
本発明の実施の形態としての文書処理装置は、図１に示すように、制御部１１およびインターフェース１２を備える本体１０と、ユーザからの入力を受けて本体１０に送る入力部２０と、外部からの信号を受信して本体１０に送る受信部２１と、本体１０からの出力を表示する表示部３０と、記録媒体３２に対して情報を記録／再生する記録／再生部３１とを有している。
【００１４】
本体１０は、制御部１１およびインターフェース１２を有し、この文書処理装置の主要な部分を構成している。制御部１１は、この文書処理装置における処理を実行するＣＰＵ１３と、揮発性のメモリであるＲＡＭ１４と、不揮発性のメモリであるＲＯＭ１５とを有している。ＣＰＵ１３は、たとえばＲＯＭ１５に記録された手順にしたがって、必要な場合にはデータを一時的にＲＡＭ１４に格納して、プログラムを実行するための制御をおこなう。インターフェース１２は、制御部１１、入力部２０、受信部２１、表示部３０および記録／再生部３１に接続される。インターフェース１２は、制御部１１の制御の下に、入力部２０および受信部２１からのデータの入力、表示部３０へのデータの送信、記録／再生部３１に対するデータの送受信について、データを送信するタイミングを調整したり、データの形式を変換したりする。
【００１５】
入力部２０は、この文書処理装置に対するユーザの入力を受ける部分である。この入力部２０は、たとえばキーボードやマウスにより構成される。ユーザは、この入力部２０を用い、キーボードによりキーワードを入力したり、マウスにより表示部３０に表示されている電子文書のエレメントを選択して入力したりすることができる。なお、以下では電子文書を単に文書と称することにする。ここで、エレメントとは文書を構成する要素であって、たとえば文書、文および語が含まれる。
【００１６】
受信部２１は、この文書処理装置に外部からたとえば通信回線を介して送信される信号を受信する部分である。この受信部２１は、外部から送信された複数の文書を受信する。受信部２１は、受信したデータを本体１０に送る。
【００１７】
表示部３０は、この文書処理装置からの文字や画像情報の出力を表示する。表示部３０は、たとえば陰極線管（cathode ray tube;CRT）や液晶表示装置（liquid crystal display;LCD）から構成され、たとえば単数または複数のウィンドウを表示し、このウィンドウ上に文字、図形等を表示したりする。
【００１８】
記録／再生部３１は、たとえばいわゆるフロッピーディスクのような記録媒体３２に対してデータの記録／再生をおこなう。記録媒体３２には、文書を処理する文書処理プログラムが記録されている。この記録媒体３２についてはさらに後述する。
【００１９】
続いて、本実施の形態における文書について説明する。本実施の形態においては、文書処理は、文書に付与された属性情報であるタグを参照しておこなわれる。本実施の形態で用いられるタグには、文書の構造を示す統語論的（syntactic）タグと、多言語間で文書の機械的な内容理解を可能にするような意味的（semantic）・語用論的タグとがある。
【００２０】
統語論的なタグとしては、文書の内部構造を記述するものがある。タグ付けによる内部構造は、図２に示すように、文書、文、語彙エレメント等の各エレメントが、通常リンク、参照・被参照リンクによりリンクされて構成されている。図中において、白丸“○”はエレメントを示し、最下位の白丸は文書における最小レベルの語に対応する語彙エレメントである。また、実線は文書、文、語彙エレメント等のエレメント間のつながり示す通常リンク（normal link ）である。破線は参照・被参照による係り受け関係を示す参照リンク（reference link）である。文書の内部構造は、上位から下位への順序で、文書（document）、サブディビジョン（subdivision ）、段落（paragraph）、文（sentence ）、サブセンテンシャルセグメント（subsentential segment ）、・・・、語彙エレメントから構成される。このうち、サブディビジョンと段落は、オプションである。
【００２１】
一方、意味論・語用論的なタグ付けとしては、多義語の意味のように意味等の情報を記述するものがある。本実施の形態におけるタグ付けは、ＨＴＭＬ（Hyper Text Markup Language）と同様なＸＭＬ（Extended Markup Language）の形式によるものである。
【００２２】
タグ付けの一例を次に示すが、文書へのタグ付けはこの方法に限られない。また、以下では英語と日本語の文書の例を示すが、タグ付けによる内部構造の記述は他の言語にも同様に適用することができる。
【００２３】
たとえば、“Time flies like an arrow.”という文については、下記のようなタグ付けをすることができる。
【００２４】
＜文＞＜名詞句語義＝“time０”＞time＜／名詞句＞
＜動詞句＞＜動詞語義＝“fly１”＞flies＜／動詞＞
＜形容動詞句＞＜形容動詞語義＝like０＞like＜／形容動詞＞＜名詞句＞an＜名詞語義＝“arrow０”＞arrow＜／名詞＞＜／名詞句＞
＜／形容動詞句＞＜／動詞句＞.＜／文＞
ここで＜文＞、＜名詞＞、＜名詞句＞、＜動詞＞、＜動詞句＞、＜形容動詞＞、＜形容動詞句＞は、それぞれ文、名詞、名詞句、動詞、動詞句、形容詞を含む前置詞句または後置詞句／形容詞句、形容詞句／形容動詞句のような文の統語構造（syntactic structure ）を表している。タグは、エレメントの先端の直前および終端の直後に対応して配置される。エレメントの終端の直後に配置されるタグは、記号“／”によりエレメントの終端であることを示している。エレメントは統語的構成素、すなわち句、節、および文を示す。なお、語義（word sense）＝“time０”は、語“time”の有する複数の意味、すなわち複数の語義のうちの第０番目の意味を指している。具体的には、語“time”には少なくとも名詞、形容詞、動詞の意味があるが、ここでは語“time”が名詞であることを示している。同様に、語“オレンジ”は少なくとも植物の名前、色、果物の意味があるが、これらも語義によって区別することができる。
【００２５】
本実施の形態における文書は、図３に示すように、表示部３０のウィンドウ１０１に統語構造を表示することができる。このウィンドウ１０１においては、右半面１０３に語彙エレメントが、左半面１０２に文の内部構造がそれぞれ表示されている。
【００２６】
このウィンドウ１０１には、タグ付けにより内部構造を記述された次に示すような文書「Ａ氏のＢ会が終わったＣ市で、一部の大衆紙と一般紙がその写真報道を自主規制する方針を紙面で明らかにした。」の一部が表示されている。この文書のタグ付けの例を次に示す。
【００２７】
＜文書＞＜文＞＜形容動詞句関係＝“位置”＞＜名詞句＞＜形容動詞句場所＝“Ｃ市”＞
＜形容動詞句関係＝“主語”＞＜名詞句識別子＝“Ｂ会”＞＜形容動詞句関係＝“所属”＞＜人名識別子＝“Ａ氏”＞Ａ氏＜／人名＞の＜／形容動詞句＞＜組織名識別子＝“Ｂ会”＞Ｂ会＜／組織名＞＜／名詞句＞が＜／形容動詞句＞
終わった＜／形容動詞句＞＜地名識別子＝“Ｃ市”＞Ｃ市＜／地名＞＜／名詞句＞で、＜／形容動詞句＞＜形容動詞句関係＝“主語”＞＜名詞句識別子＝“press” 統語＝“並列”＞＜名詞句＞＜形容動詞句＞一部の＜／形容動詞句＞大衆紙＜／名詞句＞と＜名詞＞一般紙＜／名詞＞＜／名詞句＞が＜／形容動詞句＞
＜形容動詞句関係＝“目的語”＞＜形容動詞句関係＝“内容” 主語＝“press”＞＜形容動詞句関係＝“目的語”＞＜名詞句＞＜形容動詞句＞＜名詞共参照＝“Ｂ会”＞そ＜／名詞＞の＜／形容動詞句＞写真報道＜／名詞句＞を＜／形容動詞句＞
自主規制する＜／形容動詞句＞方針を＜／形容動詞句＞
＜形容動詞句関係＝“位置”＞紙面で＜／形容動詞句＞
明らかにした。＜／文＞＜／文書＞
【００２８】
この文書においては、「一部の大衆紙と一般紙」は、統語＝“並列”というタグにより並列であることが表されている。並列の定義は、係り受け関係を共有すると言うことである。特に何も指定がない場合は、たとえば、＜名詞句関係＝ｘ＞＜名詞＞Ａ＜／名詞＞＜名詞＞Ｂ＜／名詞＞＜／名詞句＞はＡがＢに依存関係のあることを表す。関係＝ｘは関係属性を表す。
【００２９】
関係属性は、統語、意味、修辞についての相互関係を記述する。主語、目的語、間接目的語のような文法機能、動作主、被動作者、受益者などのような主題役割、および理由、結果などのような修辞関係はこの関係属性により記述される。本実施の形態では、主語、目的語、間接目的語のような比較的容易な文法機能について関係属性を記述する。
【００３０】
また、この文書においては、“Ａ氏”、“Ｂ会”、“Ｃ市”のような固有名詞について、地名、人名、組織名等のタグにより属性が記述されている。これら地名、人名、組織名等のタグが付与される語は固有名詞である。
【００３１】
以下では、本発明に係る実施の形態としての文書処理装置の動作について説明する。文書処理装置は、文書に対する実関心度を検出し、検出した実関心度に基づいて他の文書に優先順位を設定するものである。文書処理装置は、文書を表示し、表示された文書に基づいて実関心度を検出する。実関心度は、ユーザの文書に対する操作に応じて検出される。この実関心度との関連度に基づいて、実関心度が与えられていない文書に対して予測関心度が定義される。予測関心度を用いると、ユーザが操作していない文書に対して優先順位を与えることができる。
【００３２】
このような実関心度の説明に先立って、文書の手動分類および文書の自動分類について説明することにする。すなわち、文書処理装置の動作について、（１）文書の手動分類、（２）文書の自動分類、（３）実関心度および予測関心度の順序で説明する。
【００３３】
説明の内容を簡単に述べると、（１）文書の手動分類においては、文書処理装置が外部から送られた文書を受信し、ユーザがこの文書を手動分類する動作について説明する。この手動分類により、文書を分類する分類モデルが作成される。（２）文書の自動分類においては、文書の手動分類により作成された分類モデルに基づいて、文書分類間関連度を用いて文書を分類する動作について説明する。（３）実関心度および予測関心度においては、ユーザの操作に基づいて検出される実関心度と、この実関心度および文書間関連度に基づいて得られる予測関心度に基づいておこなわれる処理について説明する。
【００３４】
（１）文書の手動分類
本実施の形態では、初期状態では分類モデルが存在しない。初期状態においては、分類モデルを作成するために、外部から送られた文書を手動によって分類する必要がある。このような文書処理装置の手動分類の動作について、図４を参照して説明する。
【００３５】
図４のステップＳ１１では、文書処理装置の受信部２１は、たとえば通信回線を介して送信された複数の文書を受信する。受信部２１は、受信した文書を文書処理装置の本体１０に送る。
【００３６】
ステップＳ１２では、文書処理装置の制御部１１は、受信部２１から送られた複数文書の特徴を抽出し、それぞれの文書の特徴情報すなわちインデックスを作成する。制御部１１は、受信した複数の文書や、作成したインデックスを、たとえばＲＡＭ１４に記憶させる。インデックスは、その文書に特徴的な、固有名詞、固有名詞以外の語義などを含む。
【００３７】
ここで、インデックスの具体例を示す。
【００３８】
＜インデックス日付＝“AAAA/BB/CC” 時刻＝“DD:EE:FF” 文書アドレス＝“1234”＞
＜ユーザの操作履歴最大要約サイズ＝“100”＞
＜選択エレメントの数＝“10”＞ピクチャーテル＜／選択＞
・・・
＜／ユーザの操作履歴＞
＜要約＞減税規模、触れず−Ｘ首相の会見＜／要約＞
＜語語義＝“0003” 中心活性値＝“140.6”＞触れず＜／語＞
＜語語義＝“0105” 識別子＝“Ｘ” 中心活性値＝“67.2”＞首相＜／語＞
＜人名識別子＝“Ｘ” 語語義＝“6103” 中心活性値＝“150.2”＞Ｘ首相＜／語／人名＞
＜語語義＝“5301” 中心活性値＝“120.6”＞求めた＜／語＞
＜語語義＝“2350” 識別子＝“Ｘ” 中心活性値＝“31.4”＞首相＜／語＞
＜語語義＝“9582” 中心活性値＝“182.3”＞強調した＜／語＞
＜語語義＝“2595” 中心活性値＝“93.6”＞触れる＜／語＞
＜語語義＝“9472” 中心活性値＝“12.0”＞予告した＜／語＞
＜語語義＝“4934” 中心活性値＝“46.7”＞触れなかった＜／語＞
＜語語義＝“0178” 中心活性値＝“175.7”＞釈明した＜／語＞
＜語語義＝“7248” 識別子＝“Ｘ” 中心活性値＝“130.6”＞私＜／語＞
＜語語義＝“3684” 識別子＝“Ｘ” 中心活性値＝“121.9”＞首相＜／語＞
＜語語義＝“1824” 中心活性値＝“144.4.”＞訴えた＜／語＞
＜語語義＝“7289” 中心活性値＝“176.8”＞見せた＜／語＞
＜／インデックス＞
【００３９】
このインデックスにおいては、＜インデックス＞および＜／インデックス＞は、インデックスの始端および終端を、＜日付＞および＜時刻＞はこのインデックスが作成された日付および時刻を、＜要約＞および＜／要約＞はこのインデックスの内容の要約の始端および終端を示している。＜語＞および＜／語＞は語の始端および終端を、それぞれ示している。語義＝“0003”は、第３番目の語義であることを示している。他についても同様である。すなわち、同じ語でも複数の意味を持つ場合があるので、それを区別するために語義ごとに番号が予め決められている。したがって、同じ語に対して単数または複数の語義が存在する。
【００４０】
また、＜ユーザの操作履歴＞および＜／ユーザの操作履歴＞は、ユーザの操作履歴の始端および終端を、＜選択＞および＜／選択＞は、選択されたエレメントの始端および終端を、それぞれ示している。最大要約サイズ＝“100”は、要約の最大のサイズが１００文字であることを、エレメントの数＝“10”は、選択されたエレメントの数が１０であることを示している。
【００４１】
図４のステップＳ１３においては、ユーザは、図５の表示の具体例に示すように文書処理装置の表示部３０に表示される文書を閲覧する。図５においては、ユーザによる分類前の文書は“他のトピックス”分類され、ウィンドウ３０１の第１の表示部３０３の“他のトピックス”に、文書のアイコンやタイトルが表示されている。文書処理装置の制御部１１は、このように表示された複数の文書のうちから、ユーザの所望の文書を表示部３０に表示するように制御する。制御部１１は、入力部２０へのユーザの入力に応じて、表示部３０に表示する文書を選択する。表示部３０には、ユーザにより選択された文書が、その領域の大きさを変更可能なウィンドウにより表示される。このウィンドウに文書の全体が表示できないときには、文書の一部が表示される。
【００４２】
なお、ユーザが文書閲覧をおこなうこのステップＳ１３は、ユーザの必要に応じて設けられる。また、図中においてこのステップＳ１３が平行四辺形で表されているのは、ユーザが操作することを示すものである。以下も同様である。
【００４３】
ここで、上述の図５で示した表示の具体例について詳細に説明する。この具体例においては、ユーザが自由に文書を分類するカテゴリを設定や変更をすることができるようにしている。このようなカテゴリの設定や変更は、ユーザが手動によりおこなう。
【００４４】
表示部３０において文書分類の表示に用いられるグラフィックユーザインターフェース（graphic user interface; GUI）の具体例は、図６に示すようになる。この文書分類ウィンドウ３０１は、画面のウィンドウの状態を初期の位置にもどすポジションリセット（position reset）のボタンと、文書の内容を閲読するブラウザ（browser ）を呼び出すブラウザのボタンと、このウィンドウからの脱出（exit）のボタンとを含む操作ボタン３０２を有している。
【００４５】
また、文書分類ウィンドウ３０１は、上述した“他のトピックス”を表示する第１の分類表示部３０３、“ビジネスニュース”を表示する第２の分類表示部３０４、“政治ニュース”を表示する第３の分類表示部３０５等が表示されている。これらの分類部には、各カテゴリに対応し、そのカテゴリに分類された文書のアイコンと文書のタイトルが表示されている。タイトルがない場合には、一文の要約が表示される。各分類表示部の大きさは固定的ではなく、たとえば入力部２０のマウスにて操作することにより、所望の大きさに変更することができる。また、分類表示部のタイトルまたはラベルも自由に変更することができる。
【００４６】
第１の分類表示部３０３の“他のトピックス”には、たとえば第２の分類表示部３０４以下に対応するカテゴリに分類される前の文書のタイトルが表示される。すなわち、この手動分類の工程では、文書処理装置が受信した文書は、一旦は第１の分類表示部３０３の“他のトピックス”に表示される。第１の分類表示部３０３に表示された文書は、以下のようにユーザによりカテゴリに分類される。
【００４７】
図４のステップＳ１４においては、ユーザは、ステップＳ１３において文書処理装置の表示部３０にて閲覧した複数の文書を分類するための複数のカテゴリからなる分類モデルを作成する。そして、分類モデルの各カテゴリに上記複数の文書を分類する。
【００４８】
分類モデルは、文書を分類する複数の分類項目すなわちカテゴリから構成される。カテゴリは、そのカテゴリに特徴的な、固有名詞、固有名詞以外の語義やカテゴリに含まれる文書アドレス等を含んでなるカテゴリインデックスから構成される。カテゴリインデックスは、固有名詞、固有名詞以外の語義を含む文書のインデックスから構成される。
【００４９】
たとえば、図７に示す分類モデルは、各カテゴリに対応するカテゴリインデックスについて、固有名詞、固有名詞以外の語義、文書アドレスの欄を有している。この分類モデルにおいては、カテゴリ“スポーツ”、“社会”、“コンピュータ”、“植物”、“美術”および“イベント”に対して、固有名詞“Ａ氏、・・・”、“Ｂ氏、・・・”、“Ｃ社、Ｇ社、・・・”、“Ｄ種、・・・”、“Ｅ氏、・・・”および“Ｆ氏”を、語義“野球（４５４６）、グランド（２３４３）、・・・”、“労働（３１１２）、固有（９８２１）、・・・”、“モバイル（２１０２）、・・・”、“桜１(１１１１１)、オレンジ１（９９１１）”、“桜２(１１１１２)、オレンジ２（９９１２）”および“桜３(１１１１３)”を、この分類モデルに対応する文書アドレス“ＳＰ１、ＳＰ２、ＳＰ３、・・・”、“ＳＯ１、ＳＯ２、ＳＯ３、・・・”、“ＣＯ１、ＣＯ２、ＣＯ３、・・・”、“ＰＬ１、ＰＬ２、ＰＬ３、・・・”、“ＡＲ１、ＡＲ２、ＡＲ３、・・・”および“ＥＶ１、ＥＶ２、ＥＶ３、・・・”をそれぞれ有している。なお、“桜１”、“桜２”および“桜３”は“桜”の第１の語義(１１１１１)、第２の語義(１１１１２)および第３の語義(１１１１３)を示している。また、“オレンジ１”および“オレンジ２”は、“オレンジ”の第１の語義（９９１１）および第２の語義（９９１２）を示している。たとえば“オレンジ１”は植物のオレンジを表し、“オレンジ２”はオレンジ色を表す。
【００５０】
分類モデルが更新されると、分類モデルに更新日時が記録される。図中には、更新日時として“１９９８年１２月１０日１９時５６分１０秒”が記録されている。
【００５１】
分類モデルのカテゴリの作成は、文書分類ウィンドウ３０１において、各カテゴリに対応する分類表示部を変更や削除したり、新たに分類表示部を設定することにより、ユーザが手動でおこなう。
【００５２】
文書のカテゴリへの分類操作は、たとえば、文書分類ウィンドウ３０１において、分類表示部に表示された文書のタイトルに対応するアイコンを、入力部２０のマウスを用い、所望のカテゴリに対応する分類表示部にドラッグすることによりおこなう。カテゴリに分類された文書のタイトルは、文書分類ウィンドウ３０１において、各カテゴリに対応する分類表示部に表示される。
【００５３】
ステップＳ１５においては、文書処理装置の制御部１１は、ステップＳ１４においておこなわれたカテゴリの作成と、このカテゴリに応じたユーザの手動による分類操作によって分類された各文書のインデックスに基づいて、分類モデルを作成する。すなわち、文書処理装置の制御部１１は、各カテゴリに分類された上記複数の文書のインデックスを集めて、分類モデルを生成する。
【００５４】
各カテゴリのカテゴリインデックスは、そのカテゴリに特徴的な固有名詞、固有名詞以外の語義、各カテゴリに分類された文書アドレスからなる。ここで、固有名詞以外の場合に語そのものではなく語義を用いるのは、同じ語でも複数の意味を有することがあるからである。そして、文書処理装置の制御部１１は、このように作成した分類モデルをたとえばＲＡＭ１４に記憶させる。
【００５５】
なお、ステップＳ１５における分類モデルの作成は、ステップＳ１４におけるカテゴリの作成と、ユーザの手動による分類操作がおこなわれる度におこなうこともできる。
【００５６】
ステップＳ１６では、文書処理装置の制御部１１は、ステップＳ１５で作成された分類モデルを登録する。制御部１１は、登録した分類モデルをたとえばＲＡＭ１４に記憶させる。
【００５７】
（２）文書の自動分類
次に、文書処理装置が分類モデルに基づいておこなう文書の自動分類について、図８を参照して説明する。この文書分類は、図４に示す処理により分類モデルが作成された後に受信した文書に対しておこなわれる。なお、この例では、一つの文書を受信する毎に図８に示す処理をおこなうこととして説明するが、複数の所定数の文書を受信する度におこなってもよいし、ユーザが図６の画面を開く操作をしたときにそれまでに受信した全文書に対して処理をおこなってもよい。
【００５８】
ステップＳ２１では、文書処理装置の受信部２１は、外部から文書を受信する。この文書の受信については、ステップＳ１１で説明したので、ここでの説明を省略することにする。
【００５９】
ステップＳ２２に進み、文書処理装置の制御部１１は、ステップＳ２１でＲＡＭ１４に記憶された文書を読み出し、インデックスを作成する。このインデックスの作成については、さらに後述する。
【００６０】
ステップＳ２３では、文書処理装置の制御部１１は、分類モデルに基づいて、インデックスを附された各文書を分類モデルのいずれかのカテゴリに自動分類する。そして、制御部１１は、分類の結果をたとえばＲＡＭ１４に記憶させる。自動分類の詳細については、さらに後述する。
【００６１】
ステップＳ２４では、文書処理装置の制御部１１は、たとえばＲＡＭ１４に記憶されたステップＳ２３での新たな文書の自動分類の結果に基づいて、分類モデルを更新する。ステップＳ２５では、文書処理装置の制御部１１は、ステップＳ２４で更新された分類モデルを登録する。制御部１１は、登録した分類モデルをたとえばＲＡＭ１４に記憶させる。
【００６２】
次に、図４のステップＳ１２および図８のステップＳ２２でのインデックス作成について、図９を参照して説明する。
【００６３】
ステップＳ３１においては、文書処理装置の制御部１１は、図４のステップＳ１１および図８のステップＳ２１で受信された文書について、エレメントの中心活性値を文書の内部構造に基づいて拡散する活性拡散を実行する。中心活性値の拡散処理については、さらに後述する。制御部１１は、活性拡散の結果として得られた各エレメントの中心活性値を、たとえばＲＡＭ１４に記憶させる。
【００６４】
ステップＳ３２においては、文書処理装の制御部１１は、ステップＳ３１で得られた各エレメントの中心活性値に基づいて、中心活性値があらかじめ設定された閾値を超えるエレメントを抽出する。制御部１１は、このように抽出したエレメントをたとえばＲＡＭ１４に記憶させる。
【００６５】
ステップＳ３３においては、文書処理装置の制御部１１は、ステップＳ３２にて抽出したエレメントをたとえばＲＡＭ１４から読み出す。そして、制御部１１は、このエレメントからすべての固有名詞を取り出してインデックスに加える。固有名詞は語義を持たず、辞書に載っていないなどの特殊の性質を有するので固有名詞以外の語とは別に扱うものである。ここで、語義とは、語の有する複数の意味のうちの各意味に対応したものである。
【００６６】
文書処理装置の制御部１１は、エレメントが固有名詞であるか否か、受信した文書に附されたタグに基づいて判断する。たとえば、図３に示したタグ付けによる内部構造においては、“Ａ氏”、“Ｂ会”および“Ｃ市”は、タグによる関係属性がそれぞれ“人名”、“組織名”および“地名”であるので固有名詞であることが分かる。そして、制御部１１は、取り出した固有名詞をインデックスに加え、その結果をたとえばＲＡＭ１４に記憶させる。
【００６７】
ステップＳ３４においては、文書処理装置の制御部１１は、たとえばＲＡＭ１４から、ステップＳ３２にて抽出したエレメントから、固有名詞以外の語義を取り出してインデックスに加え、その結果をＲＡＭ１４に記憶させる。
【００６８】
このように、文書の特徴を発見してインデックスを作成する手順は、タグ付けされた文書の特徴を発見して、その特徴を配列したインデックスを作るものである。文書の特徴は、文書の内部構造に応じて拡散処理された中心活性値に基づいて判断される。
【００６９】
なお、上述のインデックスには、文書の特徴を表す語義および固有名詞とともに、その文書がＲＡＭ１４において記憶された位置を示す文書アドレスを含めておく。
【００７０】
インデックスは文書を代表するような特徴を表す語義および固有名詞を含むので、所望の文書を参照する際に用いることができる。
【００７１】
次に、文書の内部構造に基づいて、エレメントに対応する中心活性値を拡散する活性拡散について、図１０を参照して説明する。活性拡散は、図９のステップＳ３１他でおこなわれる。活性拡散は、中心活性値の高いエレメントと関わりのあるエレメントにも高い中心活性値を与えるような処理である。この中心活性値は、タグ付けによる内部構造に応じて決定されるので、文書の特徴の抽出等に利用される。
【００７２】
ステップＳ８１では、文書処理装置の制御部１１は、参照・被参照リンクと通常リンクに関しては、エレメントを連結するリンクの端点の端点活性値を０に設定する。制御部１１は、このように付与した端点活性値の初期値を、たとえばＲＡＭ１４に記憶させる。
【００７３】
エレメントとエレメントの連結は、たとえば図１１に示すようになる。この図においては、文書を構成するエレメントとリンクの構造の一部として、エレメントＥ_iおよびエレメントＥ_jが示されている。エレメントＥ_iとエレメントＥ_jとは、中心活性値ｅ_iおよびｅ_jをそれぞれ有し、リンクＬ_ijにて接続されている。リンクＬ_ijのエレメントＥ_iに接続する端点はＴ_ij、エレメントＥ_jに接続する端点はＴ_jiである。エレメントＥ_iは、リンクＬ_ijにより接続されるエレメントＥ_jの他に、リンクＬ_ik、Ｌ_ilおよびＬ_imによって図示しないエレメントＥ_k、Ｅ_lおよびＥ_mにそれぞれ接続している。エレメントＥ_jは、エレメントＥ_jを基準としたリンクＬ_ijであるＬ_jiにより接続されるエレメントＥ_iの他に、リンクＬ_jp、Ｌ_j _qおよびＬ_jrによって図示しないエレメントＥ_p、Ｅ_qおよびＥ_rにそれぞれ接続している。
【００７４】
ステップＳ８２においては、文書処理装置の制御部１１は、文書を構成するエレメントＥ_iを計数するカウンタの初期化をおこなう。すなわち、エレメントを計数するカウンタのカウント値ｉを１に設定する。このカウンタは、第１番目のエレメントＥ₁を参照することになる。
【００７５】
ステップＳ８３においては、文書処理装置の制御部１１は、カウンタが参照するエレメントについて、新たな中心活性値を計算するリンク処理を実行する。このリンク処理については、さらに後述する。
【００７６】
ステップＳ８４においては、文書処理装置の制御部１１は、文書中のすべてのエレメントについて新たな中心活性値の計算が完了したか否かを判断する。そして、制御部１１は、文書中のすべてのエレメントについて中心活性値の計算が完了したときには“ＹＥＳ”としてステップＳ８５に処理を進め、文書中のすべてのエレメントについて新たな中心活性値の計算が完了していないときには“ＮＯ”としてステップＳ８７に処理を進める。
【００７７】
具体的には、制御部１１は、カウンタのカウント値ｉが、文書の含むエレメントの総数に達したか否かを判断する。そして、制御部１１は、カウンタのカウント値ｉが文書に含まれるエレメントの総数に達したときには、すべてのエレメントが計算済みとしてステップＳ８５に処理を進める。制御部１１は、カウンタのカウント値ｉが文書に含まれるエレメントの総数に達していないときにはすべてのエレメントについて計算が終了していないとしてステップＳ８７に処理を進める。
【００７８】
ステップＳ８７においては、文書処理装置の制御部１１は、カウンタのカウント値ｉを１増加させて、カウンタのカウント値をｉ＋１とする。このことにより、カウンタはｉ＋１番目Ｅ_i+1のエレメント、すなわち次のエレメントを参照する。そして、処理はステップＳ８３にもどり、端点活性値の計算およびこれに続く一連の行程が、次のｉ＋１番目のエレメントＥ_i+1について実行される。
【００７９】
ステップＳ８５においては、文書処理装置の制御部１１は、文書に含まれるすべてのエレメントの中心活性値の変化分、すなわち新たに計算された中心活性値の元の中心活性値に対する変化分について平均値を計算する。
【００８０】
文書処理装置の制御部１１は、たとえばＲＡＭ１４に記憶された元の中心活性値と新たに計算した中心活性値を、文書に含まれるすべてのエレメントについて読み出す。制御部１１は、新たに計算した中心活性値の元の中心活性値に対するそれぞれの変化分の総和を文書に含まれるエレメントの総数で除することにより、すべてのエレメントの中心活性値の変化分の平均値を計算する。制御部１１は、このように計算したすべてのエレメントの中心活性値の変化分の平均値を、たとえばＲＡＭ１４に記憶させる。
【００８１】
ステップＳ８６においては、制御部１１は、ステップＳ８９で計算したすべてのエレメントの中心活性値の変化分の平均値が、あらかじめ設定された閾値以内であるか否かを判断する。そして、制御部１１は、上記変化分が閾値以内であると“ＹＥＳ”としてこの一連の行程を終了する。上記制御部１１は、上記変化分が閾値以内でないときには“ＮＯ”として、ステップＳ８２にてカウンタのカウント値ｉを１に設定して文書のエレメントの中心活性値を計算する一連の行程を再び実行する。この一連の行程にて構成されるステップＳ８２からステップＳ８４に至るループが繰り返されるごとに上記変化分は徐々に減少する。
【００８２】
続いて、図１０のステップＳ８３にて実行されるリンク処理について、図１２を参照して説明する。ここでは、一のエレメントＥ_iに対する処理を例にとるが、中心活性値の拡散処理の際には、リンク処理はすべてのエレメントに対しておこなわれる。
【００８３】
ステップＳ５１では、文書処理装置の制御部１１は、文書を構成するエレメントＥ_iと一端が接続されたリンクを計数するカウンタの初期化をおこなう。すなわち、リンクを計数するカウンタのカウント値ｊを１に設定する。カウンタは、エレメントＥ_iと接続された第１番目のエレメントＬ_i1を参照することになる。
【００８４】
ステップＳ５２では、文書処理装置の制御部１１は、エレメントＥ_iとＥ_jを接続するリンクＬ_ijについて、関係属性のタグを参照することにより通常リンクであるか否かを判断する。制御部１１は、リンクＬ_ijが通常リンクのときには“ＹＥＳ”としてステップＳ５３に処理を進め、リンクＬ_ijが参照リンクのときには“ＮＯ”としてステップＳ５４に処理を進める。
【００８５】
ステップＳ５３においては、文書処理装置の制御部１１は、エレメントＥ_iの通常リンクＬ_ijに接続された端点Ｔ_ijの新たな端点活性値を計算する処理をおこなう。
【００８６】
ここでは、ステップＳ５２における判別により、リンクＬ_ijは通常リンクであることが明らかになっている。エレメントＥ_iの通常リンクＬ_ijに接続される端点Ｔ_ijの端点活性値ｔ_ijは、エレメントＥ_jの端点活性値のうち、リンクＬ_ij以外のリンクに接続するすべての端点Ｔ_jp、Ｔ_jq、Ｔ_jrの端点活性値ｔ_jp、ｔ_jq、ｔ_jrと、エレメントＥ_iがリンクＬ_ijにより接続されるエレメントＥ_jの中心活性値ｅ_jを加算し、この加算で得た値を文書に含まれるエレメントの総数で除することにより求められる。
【００８７】
文書処理装置の制御部１１は、たとえばＲＡＭ１４から、端点活性値および中心活性値を読み出す。制御部１１は、読み出された端点活性値および中心活性値について、上述のようにその通常リンクと接続された端点の新たな端点活性値を計算する。そして制御部１１は、このように計算した端点活性値を、たとえばＲＡＭ１４に記憶させる。
【００８８】
ステップＳ５４では、文書処理装置の制御部１１は、エレメントＥ_iの参照リンクに接続された端点Ｔ_ijの端点活性値を計算する処理をおこなう。
【００８９】
ステップＳ５２における判別により、リンクＬ_ijは参照リンクであることが明らかになっている。エレメントＥ_iの参照リンクＬ_ijに接続する端点Ｔ_ijの新たな端点活性値ｔ_ijは、エレメントＥ_jの端点活性値のうち、このリンクＬ_ijを除いたリンクに接続するすべての端点Ｔ_jp、Ｔ_jq、Ｔ_jrの端点活性値ｔ_jp、ｔ_jq、ｔ_jrと、エレメントＥ_iがリンクＬ_ijにより接続されるエレメントＥ_jの中心活性値ｅ_jを加算することにより求められる。
【００９０】
文書処理装置の制御部１１は、たとえばＲＡＭ１４に記憶された端点活性値および中心活性値から、必要な端点活性値および中心活性値を読み出す。制御部１１、読み出された端点活性値および中心活性値を用いて、上述のように参照リンクと接続された新たな端点活性値を計算する。そして制御部１１は、このように計算した端点活性値を、たとえばＲＡＭ１４に記憶させる。
【００９１】
ステップＳ５３における通常リンクの処理、およびステップＳ５４における参照リンクの処理は、ステップＳ５２からステップＳ５５に至るループにあるように、カウント値ｉにより参照されているエレメントＥ_iに接続するすべてのリンクＬ_ijに対して実行される。
【００９２】
ステップＳ５５では、文書処理装置の制御部１１は、エレメントＥ_iに接続するすべてのリンクについて端点活性値が計算されたか否かを判別する。そして、すべてのリンクについて端点活性値が計算されているときには“ＹＥＳ”としてステップＳ５７に進み、すべてのリンクについて端点活性値が計算されていないときには“ＮＯ”としてステップＳ５７に進む。
【００９３】
ステップＳ５６においては、ステップＳ５５にてエレメントＥ_iのすべてのリンクＬ_ijについて端点活性値ｔ_ijが求められたことが判別されたので、文書処理装置の制御部１１は、エレメントＥ_iの中心活性値ｅ_iの更新を実行する。
【００９４】
エレメントＥ_iの中心活性値ｅ_iの新たな値すなわち更新値は、エレメントＥ_iの現在の中心活性値ｅ_iとエレメントＥ_iのすべての端点の新たな端点活性値の和ｅ_i’＝ｅ_i＋Σｔ_j’をとることにより求められる。ここで、プライム“’”は、新たな値という意味である。
【００９５】
文書処理装置の制御部１１は、たとえばＲＡＭ１４に記憶された端点活性値および中心活性値から必要な端点活性値を読み出す。制御部１１は、上述したような計算を実行し、そのエレメントＥ_iの中心活性値ｅ_iを算出する。そして、制御部１１は、計算した新たな中心活性値ｅ_iをたとえばＲＡＭ１４に記憶させる。
【００９６】
次に、図８のステップＳ２３での自動分類について、図１３を参照して説明する。
【００９７】
ステップＳ７１では、文書処理装置の制御部１１は、分類モデルのカテゴリＣ_iに含まれる固有名詞の集合と、ステップＳ２１で受信した文書から抽出されインデックスに入れられた語のうちの固有名詞の集合とについて、これらの共通集合の数をＰ（Ｃ_i ）とする。そして、制御部１１は、このようにして算出した数Ｐ（Ｃ_i ）をたとえばＲＡＭ１４に記憶させる。
【００９８】
ステップＳ７２においては、文書処理装置の制御部１１は、その文書のインデックス中に含まれる全語義と各カテゴリＣ_iに含まれる全語義との語義間関連度を、後述する図１５に示す語義間関連度の表を参照し、語義間関連度の総和Ｒ（Ｃ_i ）を演算する。すなわち、制御部１１は、分類モデルにおける固有名詞以外の語について、全語義間関連度の総和Ｒ（Ｃ_i ）を演算する。そして、制御部１１は、演算した語義間関連度の総和Ｒ（Ｃ_i ）をたとえばＲＡＭ１４に記憶させる。
【００９９】
ステップＳ７３においては、文書処理装置の制御部１１は、カテゴリＣ_i に対する文書の文書分類間関連度を
Ｒｅｌ（Ｃ_i ）＝ｍ₁Ｐ（Ｃ_i ）＋ｎ₁Ｒ（Ｃ_i ）
と定義する。ここで、係数ｍ₁、ｎ₁は定数で、それぞれの値の文書分類間関連度への寄与の度合いを表すものである。制御部１１は、ステップＳ７２で算出した共通集合の数Ｐ（Ｃ_i ）およびステップＳ７３で算出した語義間関連度の総和Ｒ（Ｃ_i ）をたとえばＲＡＭ１４から読み出し、上述の式に当てはめて文書分類間関連度Ｒｅｌ（Ｃ_i ）を算出する。なお、これらの係数ｍ₁、ｎ₁の値としては、たとえばｍ₁＝１０、ｎ₁＝１とすることができる。そして、制御部１１は、このように算出した文書分類間関連度Ｒｅｌ（Ｃ_i ）をたとえばＲＡＭ１４に記憶させる。
【０１００】
係数ｍ₁およびｎ₁の値は、統計的手法を使って推定することもできる。すなわち、制御部１１は、複数の係数ｍおよびｎの対について文書分類間関連度Ｒｅｌ（Ｃ_i ）が与えられると、上記係数を最適化により求めることができる。
【０１０１】
ステップＳ７４においては、文書処理装置の制御部１１は、カテゴリＣ_iに対する文書分類間関連度Ｒｅｌ（Ｃ_i ）が最大で、その文書分類間関連度Ｒｅｌ（Ｃ_i ）の値がある閾値を越えているとき、そのカテゴリＣ_iに文書を分類する。すなわち、制御部１１は、複数のカテゴリに対してそれぞれ文書分類間関連度を作成し、最大の文書分類間関連度が閾値を越えているときには、文書を最大の文書分類間関連度を有する上記カテゴリＣ_iに分類する。最大の文書分類間関連度が閾値を越えていないときには、文書の分類はおこなわない。
【０１０２】
次に、図１３のステップＳ７２で用いられる語義間関連度の演算について、図１４を参照して説明する。この図１４に示す処理は、図４に示す処理を行う前に一度だけおこなえばよい。
【０１０３】
ステップＳ６１において、文書処理装置の制御部１１は、電子辞書内の語の語義の説明を用いて、この辞書を使って語義のネットワークを作成する。すなわち、辞書における各語義の説明と、この説明中に現れる語義との参照関係から、語義のネットワークを作成する。これにより、辞書を最上位の頂点とするツリー状の語義のネットワークが構成される。ネットワークの内部構造は、上述したようなタグ付けにより記述される。文書処理装置の制御部１１は、たとえばＲＡＭ１４に記憶された電子辞書について、語義とその説明を順に読み出して、ネットワークを作成する。制御部１４は、このようにして作成した語義のネットワークをたとえばＲＡＭ１４に記憶させる。
【０１０４】
なお、上記ネットワークは、文書処理装置の制御部１１が辞書を用いて作成する他に、受信部２１にて外部から受信したり、記録／再生部３１にて記録媒体３２から再生したりすることにより得ることもできる。上記辞書は、受信部２１にて外部から受信したり、記録／再生部３１にて記録媒体３２から再生したりすることにより得られる。
【０１０５】
ステップＳ６２において、ステップＳ６１で作成された語義のネットワーク上で、各語義のエレメントに対応する中心活性値の拡散処理をおこなう。この活性拡散により、各語義に対応する中心活性値は、上記辞書により与えられたタグ付けによる内部構造に応じて与えられる。中心活性値の拡散処理については、さらに後述する。
【０１０６】
ステップＳ６３においては、ステップＳ６１で作成された語義のネットワークを構成する一の語義ｓ_iを選択し、ステップＳ６４においては、この一の語義ｓ_iに対応する語彙エレメントＥ_iの中心活性値ｅ_iの初期値を変化させ、このときの中心活性値の差分Δｅ_iを計算する。
【０１０７】
ステップＳ６５においては、ステップＳ６４におけるエレメントＥ_iの中心活性値ｅ_iの差分Δｅ_iに対応する、他の語義ｓ_jに対応するエレメントＥ_jの中心活性値ｅ_jの差分Δｅ_jを求める。ステップＳ６６においては、ステップＳ６５で求めた差分Δｅ_jをステップＳ６４で求めたΔｅ_iで除した商Δｅ_j／Δｅ_iを、語義ｓ_iの語義ｓ_jに対する語義間関連度とする。
【０１０８】
ステップＳ６７においては、一の語義ｓ_iと他の語義ｓ_jとのすべての対について語義間関連度の演算が終了したか否かについて判断する。そして、すべての語義の対について語義間関連度の演算が終了したときには“ＹＥＳ”として、この一連の処理を終了する。すべての語義の対について語義間関連度の演算が終了していないときには、“ＮＯ”として、ステップＳ６３にもどり、語義間関連度の演算が終了していない対について語義間関連度の演算を継続する。
【０１０９】
ステップＳ６３からステップＳ６７のループにおいて、文書処理装置の制御部１１は、必要な値をたとえばＲＡＭ１４から順に読み出して、上述したように語義間関連度を計算する。制御部１１は、計算した語義間関連度をたとえばＲＡＭ１４に順に記憶させる。
【０１１０】
このように計算された語義間関連度は、図１５に示すように、それぞれの語義と語義の間に定義される。この表においては、語義間関連度は０から１までの値をとるように正規化されている。この表においては“コンピュータ”、“テレビ”、“ＶＴＲ”の間の相互の語義間関連度が示されている。“コンピュータ”と“テレビ”の語義間関連度は０．５５、“コンピュータ”と“ＶＴＲ”の語義間関連度は０．２５、“テレビ”と“ＶＴＲ”の語義間関連度は０．６０である。
【０１１１】
（３）実関心度および予測関心度
次に、図４のステップＳ１３の詳細について、図１６を参照して説明する。この処理をおこなうことで実関心度が検出される。
【０１１２】
ステップＳ１０１では、ユーザは、図６に示す文書分類ウィンドウ３０１から所望の文書を選択する。たとえば、ユーザは、文書分類ウィンドウ３０１の分類表示部に表示された文書のタイトルに対応するアイコンを、入力部２０のマウスにて選択する。そして、操作ボタン３０２の“ブラウザ（browser）”のボタンを選択することにより、次のステップＳ１０２の表示のステップに進む。
【０１１３】
ステップＳ１０２では、文書処理装置の制御部１１は、ステップＳ１０１においてユーザが選択した文書を、たとえばＲＡＭ１４から読み出す。制御部１１は、表示部３０において、読み出した文書をウィンドウ５１の文書表示部５３に表示する。上述したように、ウィンドウ５１の文書表示部５３に文書が全部表示できないときには、その文書の一部が表示される。
【０１１４】
ステップＳ１０３では、ユーザは、ステップＳ１０２でウィンドウ５１の文書表示部５３に表示された文書について、閲読や要約の作成をおこなう。すなわち、ユーザは、ステップＳ１０２で表示されたウィンドウ５１の文書表示部５３にて文書を閲読する。また、ユーザは、ウィンドウ５１の操作ボタン５６の“要約（summerize）”ボタンを選択することにより、文書表示部５３に表示された文書の要約を要約表示部５４に表示する。
【０１１５】
ここで、要約表示部５４に要約を作成して表示する際に、文書処理部５３に表示された文書について、文書中のユーザが選択したエレメントの重要度をユーザの操作により高める手順を図１７に示すフローチャートを参照して説明する。
【０１１６】
最初のステップＳ９１においては、制御部１１は文書中のエレメントがユーザにより選択されたか否かを判断する。この判断は、図１８に示す、ユーザによる入力を受け付けるグラフィックユーザインタフェース(grafic user interface; GUI)を用いた選択により行われる。
【０１１７】
ウィンドウ５１は、文書のファイル名を表示するファイル名表示部５２と、ファイル名表示部５２に表示されたファイル名の文書を表示する文書表示部５３と、文書表示部５３に表示された文書の要約を表示する要約表示部５４を有している。文書表示部５３には、ファイル名表示部５２にファイル名または文書の先頭部分が表示された文書の全部または一部が表示される。文書表示部５３に文書の一部のみが表示されたときには、たとえば文書表示部５３に表示されている文書をスクロールすることにより、順次に文書の全体を閲覧することができる。要約表示部５３には、この要約表示部５４の大きさに対応して、後述する処理によって文書表示部５３に表示された文書の要約が表示される。要約表示部５３は、また要約が作成されていないので、空白となっている。なお、文書処理部５３と要約表示部５４のサイズはそれぞれ変更が可能である。このウィンドウ５１において取り扱う文書は、たとえば文書処理装置の受信部２１で受信されて、記録／再生部３１やＲＡＭ１４に記録されたものである。
【０１１８】
また、このウインドウ５１は、キーワードを入力するキーワード入力部５５と、複数のボタンを有するボタン部５６とを有している。キーワード入力部５５には、キーワードを入力することにより、文書表示部５４に表示された語のうちでキーワードと関連度の高い語の重要度が高められる。ボタン部５６には、実行した結果をもとに戻す“アンドゥ(Undo)”ボタンと、文書表示部５３に表示された文章を要約して要約表示部５４に表示する処理を実行する“要約(summarize)”ボタンとを備えている。このうち、“要約”ボタンを選択することにより、たとえば要約表示部５４のサイズが変更されたときにも、新たな要約表示部５４の新たなサイズに対応するように文書処理部５３に表示されている文書の要約が生成され、生成された要約は要約表示部５４に表示される。
【０１１９】
図１７のステップＳ９１では、制御部１１は、文書処理装置の表示部３０に表示されたウィンドウ５１において、文書表示部５３に表示された文章中のエレメントがユーザによって選択されたか否かを判断する。文書表示部５３中のエレメントを選択して入力する文書処理装置の入力部２０としては、ポインティングデバイスを用いて、このポインティングデバイスに連動する表示部３０に表示されたカーソルを操作することにより行うことができる。たとえば、ポインティングデバイスとしてマウスを採用した場合には、マウスを操作してカーソルを文書処理部５３の所望のエレメントにあわせ、マウスでクリックすることによりそのエレメントを選択する。文書表示部５３においてエレメントが選択されると、選択されたエレメントを明瞭に示すために、選択されたエレメントがたとえばハイライト表示される。図１９においては、ウィンドウ５１の文書表示部５３においては、選択された最小のエレメントである語彙エレメント“mainframe”５７がハイライト表示されている。要約表示部５３は、まだ要約が作成されていないので、空白となっている。制御部１１は、このようにしてエレメントが選択されると“ＹＥＳ”として処理を次のステップＳ９２に進める。制御部１１は、エレメントが選択されないとき、たとえば所定時間内に入力がなかったり、文書表示部５３の文章が表示されている部分以外がマウスによってクリックされたときには、“ＮＯ”として再びこのステップＳ９１に処理を戻し、エレメントの入力を待つことにする。なお、以下では、説明の便宜のために入力部２０のポインティングデバイスとしてはマウスを利用するものとして説明を進める。
【０１２０】
ステップＳ９２では、文書処理装置の制御部１１は、ステップＳ９１において選択されたが、過去にマウスでクリックすることにより選択された語であるか否かが判断される。制御部１１は、そのエレメントが過去にマウスでクリックすることにより選択されたエレメントであるときには“ＹＥＳ”として処理をステップＳ９３に進める。制御部１１は、そのエレメントが過去にマウスでクリックすることにより選択されたエレメントでないときには、“ＮＯ”として処理をステップＳ９４に進める。
【０１２１】
ステップＳ９３では、文書処理装置の制御部１１は、選択されているエレメントが、文章エレメントであるか否かを判別する。制御部１１は、レベルが文章エレメントであるときには“ＹＥＳ”として処理をステップＳ９１に戻す。制御部１１は、レベルが文章エレメントでないときには“ＮＯ”として処理を次のステップＳ９５に進める。
【０１２２】
ステップＳ９４では、文書処理装置の制御部１１は、レベルを、文書の最小のエレメントであって文書のタグ付けによる内部構造の最下位のエレメントである語彙エレメントに設定する。そして、制御部１１は、処理をステップＳ９１に戻す。
【０１２３】
ステップＳ９５では、文書処理装置の制御部１１は、レベルを１増加させる。たとえば、このようにレベルが１増加することにより、ステップＳ９１で選択された語彙エレメント“mainframe”５７については、図２０に示すように、この語彙エレメントを含む次に大きな上位のエレメント“Big mainframe computers”５９が選択され、この部分“Big mainframe computers”５９がハイライト表示されることになる。同時に、制御部１１は、選択された上位のエレメントの重み付け、すなわち中心活性値を選択されていないエレメントよりも高める。そして、制御部１１は、処理をステップＳ１１に戻す。
【０１２４】
ウィンドウ５１のボタン部５６に表示された“要約”ボタンがマウスのクリックにより選択されると、文書表示部５３に表示された文章の要約が要約表示部５４に表示される。“要約”ボタンが選択されると、制御部１１は、図１７に示した一連の工程から処理を割り込みにより脱出するように制御し、要約を作成する処理を開始する。要約は、文書表示部５３に表示された文書から、要約表示部５４のサイズに合わせて、要約表示部５４の領域を満たすように生成される。図２１に示すように、要約表示部５４に表示された要約には、文書表示部５９においてハイライト表示されたエレメント“Big mainframe computers”５９に対応するエレメント“Big mainframe computers”６０が表示されている。このように、ウィンドウ５１の文書表示部５３において所望のエレメントを選択して重要度を高めることにより、そのエレメントが要約に含まれる可能性を高くすることができる。なお、要約の生成の詳細については、さらに後述する。
【０１２５】
図１８に示したウィンドウ５１においては、文書表示部５３に表示された文書中のエレメントの選択はマウスによるクリック以外にも、キーワード入力部５５にキーワードを入力することによって選択することができる。制御部１１は、このようにキーワード入力部５５に入力されたキーワードに関連するエレメントの重要度を上げる処理を行う。キーワードとエレメントの関連度は、たとえばＲＯＭ１５に記録されたテーブルを参照することにより得る。この参照は、キーワードが含まれるエレメントをタグ付けによって参照することによりおこなわれる。
【０１２６】
図１６のステップＳ１０４では、文書処理装置の操作部１１は、ユーザの文書への実関心度を演算する。実関心度は、ステップＳ１０３におけるユーザのウィンドウ５１に表示された文書への操作に基づいて演算される。
【０１２７】
ここで、本実施の形態に用いられる実関心度と予測関心度について説明する。実関心度とは、このステップＳ１０４で演算されるものであって、ユーザの操作により検出される、ユーザが操作した文書に対する実際の関心度である。これに対して、予測関心度とは、ユーザの文書に対する関心度を予測したものである。この予測関心度は、たとえば実関心度に基づいて予測される。
【０１２８】
ステップＳ１０５では、制御部１１は、ユーザの操作履歴をインデックスに記録する。上述したインデックスにの具体例においては、ユーザの操作履歴として、
＜ユーザの操作履歴最大要約サイズ＝“100”＞
＜選択エレメントの数＝“10”＞ピクチャーテル＜／選択＞
・・・
＜／ユーザの操作履歴＞
が例示されていた。ステップＳ１０５においては、制御部１１は、要約の最大サイズや、選択されたエレメントや、選択されたエレメントの数のような操作履歴を更新する。制御部１１は、更新したインデックスを、たとえばＲＡＭ１４に記憶させる。
【０１２９】
なお、インデックスには文書の実関心度を含めておくこともできる。たとえば、カテゴリごとに各文書に対する実関心度をインデックスに含めることができる。このような場合には、ステップＳ１０５において、その文書に関するインデックスに含まれる実関心度自体も更新される。
【０１３０】
次に、図１６のステップＳ１０３でのユーザの操作について、図２２、図２３、図２３および図２４を用いて説明する。
【０１３１】
文書分類ウィンドウ３０１にタイトルが表示された文書は、たとえば、入力部２０のマウスを用いて表示部３０において選択することにより、表示部３０に表示させることができる。このように文書を表示する文書表示ウィンドウの具体例は、図１８に示したので、ここでの説明を省略する。
【０１３２】
続いて、要約を作成する処理の図４に示すものより詳細の制御を含む例について図２２に示すフローチャートを参照して詳細に説明する。この一連の工程は、“要約”ボタン１０３をオンすることによって開始される。
【０１３３】
文書から要約を作成する処理は、文書のタグ付けによる内部構造に基づいて実行される。上述したように、ウィンドウ１００において要約を表示する表示領域１３０のサイズは変更することができる。文書処理装置の制御部１１は、新たにウィンドウ１０１が表示部３０のウィンドウ１００に描画されるか表示領域１３０のサイズが変更され、実行ボタン１０３が操作されたときには、表示領域１３０に適合するようにウィンドウ１００の表示領域１２０に表示されている文書から要約を作成する処理を実行する。
【０１３４】
図２２の最初のステップＳ１２０では、文書処理装置の制御部１１は、活性拡散を行う。本実施の形態においては、活性拡散により得られた中心活性値を重要度として採用することにより、文書の要約を行う。すなわち、タグ付けによる内部構造を与えられた文書においては、活性拡散と呼ばれる処理を行うことにより、各エレメントにタグ付けによる内部構造に応じた中心活性値を付与することができる。活性拡散は、中心活性値の高いエレメントと関わりのあるエレメントにも高い中心活性値を与えるような処理である。すなわち、活性拡散は、照応(共参照)表現とその先行詞の間で中心活性値が等しくなり、それ以外では中心活性値が減衰するような中心活性値についての演算である。この中心活性値は、タグ付けによる内部構造に応じて決定されるので、タグ付けによる内部構造を考慮した文書の分析に利用することができる。
【０１３５】
ステップＳ１２１では、文書処理装置の制御部１１は、表示部３０に表示されているウィンドウ５１の文書処理部５３のサイズ、具体的にはこの文書処理部５３に表示可能な最大文字数をｗ_sと設定する。また、文書処理装置の制御部１１は、要約の文字列を格納するｓを初期化して初期値ｓ₀＝””と設定する。制御部１１は、このように設定した、文書表示部５３に表示可能な最大文字数ｗ_sおよび要約の文字列を格納するｓの初期値ｓ₀を、たとえばＲＡＭ１４に記録する。
【０１３６】
ステップＳ１２２では、文書処理装置の制御部１１は、要約の骨格の順次の作成をカウントするカウンタのカウント値ｉを零に設定する。すなわち、制御部１１は、カウント値について、ｉ＝０と設定する。制御部１１は、このように設定したカウント値ｉをたとえばＲＡＭ１４に記録する。
【０１３７】
ステップＳ１２３では、文書処理装置の制御部１１は、カウンタのカウント値ｉについて、文章からｉ番目に平均中心活性値の高い文の骨格を抽出する。平均中心活性値とは、一つの文を構成する各エレメントの中心活性値を平均したものである。制御部１１は、たとえばＲＡＭ１４に記録した要約を格納するｓ_i-1を読み出し、このｓ_i-1に対して抽出した文の骨格の文字列を加えて、ｓ_iとする。そして、制御部１１は、このようにして得たｓ_iを、たとえばＲＡＭ１４に記録する。同時に、制御部１１は、上記文の骨格に含まれないエレメントの中心活性値順のリストｌ_iを作成し、このリストｌ_iをたとえばＲＡＭ１４に記録する。
【０１３８】
すなわち、このステップＳ１２３においては、要約のアルゴリズムは、活性拡散の結果を用いて、平均中心活性値の大きい順に文を選択し、選択された文の骨格の抽出する。文の骨格は、文から抽出した必須要素により構成される。必須要素になりうるのは、エレメントの主辞(head)と、主語(subject)、目的語(object)、間接目的語(indirect object)、所有者(possessor)、原因(cause)、条件(condition)または比較(comparison)の関係属性を有する要素と、等位構造が必須要素のときにはそれに直接含まれるエレメントとが必須要素を構成するものである。そして、文の必須要素をつなげて文の骨格を生成し、要約に加える。
【０１３９】
ステップＳ１２４では、文書処理装置の制御部１１は、ｓ_iの長さがウィンドウ５１の要約表示部５４の最大文字数ｗ_sより大きいか否かを判断する。そして、制御部１１は、ｓ_iの長さが最大文字数ｗ_sより大きいときには“ＹＥＳ”としてこの一連の処理を終了する。制御部は、ｓ_iの長さが最大文字数ｗ_sより大きくないときには“ＮＯ”として処理をステップＳ１２５に進める。すなわち、このステップＳ１２４においては、要約が指定された分量に達したときは終了する。まだ余裕がある場合は、次に中心活性値の高い文と省略したエレメントの中心活性値を比較して、高いほうを要約に加えるものである。
【０１４０】
ステップＳ１２９では、文書処理装置の制御部１１は、ステップＳ１２４でｓ_iの長さが最大文字数ｗ_sより大きいと判断されたので、要約をｓ_i-1に設定する。この場合、要約はウィンドウにおさまらないのでｓ_i＝ｓ₀＝“”を出力する。したがって、このときには要約は表示されないこととなる。そして、制御部１１は、この一連の工程を終了する。
【０１４１】
ステップＳ１２５では、文書処理装置の制御部１１は、ｉ＋１番目に平均中心活性値が中心活性値と、ステップＳ２３で作成したリストｌ_iの要素の最も中心活性値が高い要素の中心活性値を比較する。そして、制御部１１は、ｉ＋１番目に平均中心活性値が高い文の中心活性値がリストｌ_iの要素の最も中心活性値が高い要素の中心活性値より高いときには“ＹＥＳ”として処理を次のステップＳ２７に進める。制御部１１は、ｉ＋１番目に平均中心活性値が高い文の中心活性値がリストｌ_iの要素の最も中心活性値が高い要素の中心活性値より高くないときには“ＮＯ”として処理をステップＳ１２６に進める。
【０１４２】
ステップＳ１２６では、文書処理装置の制御部１１は、カウンタのカウント値ｉを１だけ増加させる。そして、制御部１１は、処理をステップＳ１２３に戻す。
【０１４３】
ステップＳ１２７においては、文書処理装置の制御部１１は、リストｌ_iの最も中心活性値の高い要素ｅをｓ_iに加えてｓｓ_iを生成する。要素ｅをｌ_iから削除する。そして、制御部１１は、このようにして生成したｓｓ_iをたとえばＲＡＭ１４に記録する。
【０１４４】
ステップＳ１２８においては、文書処理装置の制御部１１は、ｓｓ_iの長さがウィンドウ５１の要約表示部５４の最大文字数ｗ_sより大きいか否かを判別する。制御部１１は、ｓｓ_iの長さがｗ_sより大きいときには“ＹＥＳ”としてこの一連の工程を終了する。制御部１１は、ｓｓ_iの長さがｗ_sより大きくないときには“ＮＯ”として処理をステップＳ１２５に戻す。
【０１４５】
ステップＳ１３０においては、文書処理装置の制御部１１は、ステップＳ１２８でＳＳ_iの長さが最大文字数ｗ_sより大きいと判断されたので、要約文をｓ_iに設定する。これにより、最大文字数ｗ_sより大きくならないように要約文が生成される。そして、制御部１１は、この一連の工程を終了する。
【０１４６】
また、このウィンドウ５１は、キーワードを入力するキーワード入力部５５と、複数のボタンを有するボタン部５６とを有している。キーワード入力部５５には、キーワードを入力することにより、文書表示部５３に表示された語のうちでキーワードと後述する語義間関連度の高い語の実関心度が高められる。ボタン部５６には、実行した結果をもとにもどす“アンドゥ(Undo)”ボタンと、文書表示部５３に表示された文章を要約して要約表示部５４に表示する処理を実行する“要約(summarize)”ボタンとを備えている。このうち、“要約”ボタンを選択することにより、たとえば要約表示部５４のサイズが変更されたときにも、要約表示部５４の新たなサイズに対応するように文書処理部５３に表示されている文書の要約が生成され、生成された要約は要約表示部５４に表示される。
【０１４７】
文書に対するユーザの実関心度は、次のような複数の要素に基づいて演算される。なお、実関心度についての要素は、文書を構成する要素とは、異なるものである。
【０１４８】
実関心度の演算では、ユーザによって指定されたエレメントのうち、文書中での出現位置が文書の先頭から最も離れたものの位置を第１の要素Ａ（Ｄ_i）とする。この第１の要素によると、ユーザによって指定されたエレメントのうち、文書中での出現位置が文書の先頭から最も離れたものの位置が大きいほど、ユーザがその文書をより多く読んだと考え、その文書への実関心度も大きいこととする。具体的には、選択されたエレメントの最大出現位置と文書全体のサイズの比率を実関心度の第１の要素Ａ（Ｄ_i）とする。ここで、Ｄ_iは第ｉ番目の文書を表している。
【０１４９】
図２３に示すウィンドウ５１の文書表示部５３においては、第１のエレメント５７、第２のエレメント５８および第３のエレメント５９がユーザによって指定され、ハイライト表示されている。実関心度の計算には、これらのうちで文書の先頭から最も離れた第３のエレメント５９が用いられる。
【０１５０】
また、実関心度の演算では、ウィンドウ５１の文書表示部５３に表示された文書のエレメントからユーザが選択したものの数や、キーワード入力部５５にユーザが入力したキーワードの数を第２の要素Ｅ（Ｄ_i）とする。
【０１５１】
図２３に示すウィンドウ５１の文書表示部５３においては、第１のエレメント５７、第２のエレメント５８および第３のエレメント５９の指定がユーザにより入力されている。また、キーワード入力部５５には、キーワード“ＡＡＡ”が入力されている。これらエレメントおよびキーワードの入力の数を実関心度の第２の要素Ｅ（Ｄ_i）とする。
【０１５２】
さらに、実関心度の演算では、ウィンドウ５１における要約表示部５４の領域のサイズの文章表示部５３の領域のサイズに対する比率を第３の要素Ｗ（Ｄ_i）とする。これは、要約表示部５４の領域のサイズに応じて要約が表示されるが、ユーザの実関心度が高いほど、ユーザは簡単ではなく詳しい要約、すなわち長い要約を求めるであろうからである。したがって、要約表示部５４の領域のサイズの文章表示部５３の領域のサイズに対する比率が増大するほど、実関心度が大きいものとすることができる。
【０１５３】
図２４に示すウィンドウ５１においては、要約を表示する要約表示部５４の最大のサイズの、文書の全部を表示した文書表示部５３のサイズに対する比率を実関心度の第３の要素Ｗ（Ｄ_i）とする。
【０１５４】
実関心度の第１の要素Ａ（Ｄ_i）、実関心度の第２の要素Ｅ（Ｄ_i）および実関心度の第３の要素Ｗ（Ｄ_i）に基づいて、ユーザの文書Ｄ_iに対する実関心度ＩＲ（Ｄｉ）は
ＩＲ（Ｄｉ）＝ｌ₂Ｗ（Ｄｉ）＋ｍ₂Ａ（Ｄｉ）＋ｎ₂Ｅ（Ｄｉ）
と定義される。ここで、係数ｌ₂、ｍ₂、ｎ₂は定数で、それぞれの値の実関心度への寄与を表すものである。なお、これらの係数ｌ₂、ｍ₂、ｎ₂の値としては、ｌ₂＝ｍ₂＝１０、ｎ₂＝１とすることができる。また、係数ｌ₂，ｍ₂，ｎ₂の値は、統計的手法を使って推定することもできる。すなわち、制御部１１は、複数の係数ｌ₂、ｍ₂、ｎ₂の組について実関心度ＩＲ（Ｄｉ）が与えられると、上記係数を最適化により求めることができる。
【０１５５】
次に、実関心度を用いて求められる予測関心度に基づいておこなう文書の並べ替えについて、図２５を参照して説明する。このような文書の並べ替えは、図６のブラウザが開いた状態でおこなわれる。
【０１５６】
ステップＳ１１１では、文書処理装置の制御部１１は、文書を分類するカテゴリを計数するカウンタのカウント値Ｃを０に設定する。ステップＳ１１２では、文書処理装置の制御部１１は、文書間関連度を演算する。すなわち、制御部１１は、図８のステップＳ２３で分類されたが未読である文書のうち、カウント値Ｃで示されるカテゴリ内の未読の各文書について、そのカテゴリ内のすでに実関心度が与えられた各文書に対する文書間関連度をそれぞれ演算する。上述のように、実関心度はユーザの操作によって与えられる。文書間関連度の演算は、上述したインデックスに基づいておこなわれる。文書間関連度の演算の詳細については、さらに後述する。
【０１５７】
ステップＳ１１３においては、文書処理装置の制御部１１は、予測関心度を演算する。予測関心度は、当該文書と、すでに実関心度が与えられた文書との間の文書間関連度に基づいて演算される。したがって、予測関心度は、実関心度が与えられていない文書に対して演算される。
【０１５８】
制御部１１は、カテゴリ内の一の未読文書について、ステップＳ１１２で演算した文書間関連度のうち、最大の値の文書間関連度を有するそのカテゴリ内の他の文書を選択する。制御部１１は、選択された他の文書の実関心度を一の未読文書の予測関心度とする。制御部１１は、このようにして得た予測関心度を、たとえばＲＡＭ１４に記憶させる。
【０１５９】
Ｓ１１８では、文書処理装置の制御部１１は、カテゴリ内のすべての文書について予測関心度の演算が終了したか否かによって処理を分岐する。制御部１１は、カテゴリ内のすべての文書について演算が終了したときには“ＹＥＳ”として処理をステップＳ１１４に進め、そうでないときには“ＮＯ”として処理をステップＳ１１２にもどす。
【０１６０】
ステップＳ１１４では、文書処理装置の制御部１１は、ステップＳ１１３で演算した予測関心度に基づいて、カテゴリごとに未読文書を並べ替える。文書の並べ替えの方法としては、予測関心度の高い未読文書に対して高い優先順序を与え、優先順位の高い未読文書ほど未読文書のタイトルの配列の先頭側にあるように配列することができる。優先順位に有意な差がない場合には、受信した日時がより新しいものを上位にする。文書のタイトルは、たとえば文書分類ウィンドウ３０１の分類表示部３０３，３０４，３０５にカテゴリごとにこのような順序で配列される。
【０１６１】
ステップＳ１１５では、文書処理装置の制御部１１は、全カテゴリが終了したか否かを判断する。制御部１１は、全カテゴリが終了したときには“ＹＥＳ”として処理をステップＳ１１７に進める。制御部１１は、全カテゴリが終了していないときには“ＮＯ”として処理をステップＳ１１６に進める。
【０１６２】
ステップＳ１１６では、文書処理装置の制御部１１は、カテゴリをカウントするカウンタ値Ｃを１だけ増やす。すなわち、制御部１１は、Ｃ＝Ｃ＋１とする。そして、制御部１１は、処理をステップＳ１１２にもどす。ステップＳ１１７では、制御部１１は、ステップＳ１１５で全カテゴリについての処理が終了されたことが判断されたので、並べ替えられた文書について表示する。具体的には、図６に示したように、文書のアイコンと文書のタイトルが表示される。文書のタイトルがない場合には、一文の要約が表示される。そして、この一連の工程を終了する。
【０１６３】
次に、図２５のステップＳ１１２の文書間関連度を計算する演算について、図２６を参照して詳細に説明する。文書間関連度とは、一の文書Ｄ_iと他の文書Ｄ_jの関連度である。
【０１６４】
ステップＳ４１では、文書処理装置の制御部１１は、一の文書Ｄ_i のインデックスに含まれる固有名詞の集合と、図２５のステップＳ１１１またはＳ１１６で指定されたカテゴリにすでに分類された他の文書Ｄ_jのインデックスに含まれる固有名詞の集合とについて、これらの共通集合の数をＰ（Ｄ_i，Ｄ_j ）とする。そして、制御部１１は、このようにして算出した数Ｐ（Ｄ_i，Ｄ_j ）をたとえばＲＡＭ１４に記憶させる。
【０１６５】
ステップＳ４２では、文書処理装置の制御部１１は、図１５に示す語義間関連度の表を参照して、一の未読文書Ｄ_iのインデックスに含まれる語義と他の文書Ｄ_jのインデックスに含まれる語義との語義間関連度の総和Ｒ（Ｄ_i，Ｄ_j）を演算する。
【０１６６】
ステップＳ４２では、文書処理装置の制御部１１は、一の未読文書Ｄ_iの固有名詞以外の語について、語義間関連度の表を参照して、他の文書Ｄ_jとの語義間関連度の総和Ｒ（Ｄ_i ，Ｄ_j）を演算する。そして、制御部１１は、演算した語義間関連度の総和Ｒ（Ｄ_i ，Ｄ_j）をたとえばＲＡＭ１４に記憶させる。
【０１６７】
ステップＳ４３では、文書処理装置の制御部１１は、一の文書Ｄ_i に対する他の文書Ｄ_jの文書間関連度を
Ｒｅｌ（Ｄ_i ，Ｄ_j）＝ｍ₃Ｐ（Ｄ_i，Ｄ_j ）＋ｎ₃Ｒ（Ｄ_i，Ｄ_j ）
と定義する。ここで、係数ｍ₃、ｎ₃は定数で、それぞれの値の文書間関連度への寄与の度合いを表すものである。制御部１１は、ステップＳ４１で算出した共通集合の数Ｐ（Ｄ_i ，Ｄ_j ）およびステップＳ４２で算出した語義間関連度の総和Ｒ（Ｄ_i ，Ｄ_j）をたとえばＲＡＭ１４から読み出し、上述の式に当てはめて文書間関連度Ｒｅｌ（Ｄ_i ，Ｄ_j ）を算出する。なお、これらの係数ｍ₃、ｎ₃の値としては、たとえばｍ₃＝１０、ｎ₃＝１とすることができる。
【０１６８】
係数ｍ₃およびｎ₃の値は、統計的手法を使って推定することもできる。すなわち、制御部１１は、複数の係数ｍ₃およびｎ₃の対について文書間関連度Ｒｅｌ（Ｄ_i ，Ｄ_j）が与えられると、上記係数を最適化により求めることができる。
【０１６９】
次に、文書処理装置の記録／再生部３１において記録／再生される記録媒体３２について説明する。記録媒体には、複数のエレメントからタグ付けによる内部構造を有する文書を処理する文書処理プログラムが記録されている。この記録媒体３２としては、情報の記録／再生が可能なたとえばフロッピーディスクが利用される。
【０１７０】
記録媒体３２は、文書に対する実関心度を検出する実関心度検出処理と、実関心度検出処理で検出した実関心度に基づいて上記文書に優先順位を設定する優先順位設定処理とを有する。さらに、記録媒体３２は、文書を表示する表示処理と、表示処理で表示された文書についての手動による入力を受ける入力処理とをさらに有し、実関心度検出処理は、上記入力手段での入力に基づいて実関心度を検出する。
【０１７１】
なお、本実施の形態においては、文書へのタグ付けの方法の一例を示したが、本発明がこのタグ付けの方法に限定されないことはもちろんである。また、本実施の形態においては、文書処理装置の受信部２１に外部から文書が送信されるとしたが、本発明はこれに限定されない。たとえば、上記文書は、文書処理装置のＲＯＭ１３に書き込まれていたり、記録／再生部３１において記録媒体３２から読み出されてもよい。
【０１７２】
また、上述の実施の形態においては、文書処理装置の表示部３０に表示された文書から所望のエレメントを選択するデバイスとしてマウスを例示したが、本発明がこれに限定されないことはいうまでもない。文書処理装置におけるエレメントの入力には、タブレット、ライトペン等の他のデバイスを利用することができる。
【０１７３】
さらに、上述の実施の形態においては、日本語および英語の文章を例示したが、本発明がこれらの言語に限られないことはいうまでもない。
【０１７４】
【発明の効果】
上述のように、本発明は、電子文書を処理するものであって、電子文書に対する実関心度を検出し、検出した実関心度に基づいて電子文書に優先順位を設定している。また、本発明は、電子文書を表示し、表示された電子文書についての手動による入力を受け付け、この入力に基づいて実関心度を検出している。したがって、本発明は、ユーザの実関心度を反映して電子文書の優先順位を設定することにより、ユーザの便宜を図っている。
【０１７５】
さらに、本発明は、すでに実関心度が求められた電子文書のうちで最も関連度の高い文書の実関心度を予測関心度として、この予測関心度に基づいて優先順位を設定している。したがって、本実施の形態は、実関心度が与えられていない文書にも優先順位を与えることができる。
【０１７６】
そして、本発明は、電子文書を複数の分類項目に分類し、分類項目ごとに電子文書に優先順位を設定している。したがって、本発明は、分類項目ごとに優先順位を設定することにより、ユーザに利便性を提供している。
【図面の簡単な説明】
【図１】本実施の形態を適用した文書処理装置の構成を示すブロック図である。
【図２】文書のタグ付けによる内部構造を示す図である。
【図３】文書のタグ付けによる内部構造を表示したウィンドウを示す図である。
【図４】本実施の形態を適用した文書処理装置の動作を示すフローチャートである。
【図５】文書の分類前の文書の分類をおこなうＧＵＩを示す図である。
【図６】文書の分類をおこなうＧＵＩを示す図である。
【図７】分類モデルの表を示す図である。
【図８】文書を自動分類するフローチャートである。
【図９】文書の特徴を発見してインデックスを作成するフローチャートである。
【図１０】活性拡散を示すフローチャートである。
【図１１】活性拡散の処理を説明する図である。
【図１２】活性拡散のリンク処理のフローチャートである。
【図１３】文書分類間関連度を演算するフローチャートである。
【図１４】語義間関連度の計算のフローチャートである。
【図１５】語義間関連度の表を示す図である。
【図１６】文書を閲覧して分類操作するフローチャートである。
【図１７】文章の任意の部分の重要度を上げる一連の工程を示すフローチャートである。
【図１８】要約ウィンドウを示す図である。
【図１９】要約ウィンドウにおいて語が選択された状態を示す図である。
【図２０】要約ウィンドウにおいて選択された領域をさらにクリックした状態を示す図である。
【図２１】要約ウィンドウに要約が表示された状態を示す図である。
【図２２】要約作成処理を詳細に示す図である。
【図２３】選択エレメントの最大出現位置からの実関心度の計算を説明する図である。
【図２４】要約エレメントの最大のサイズと文書全体の比率からの実関心度の算出を説明する図である。
【図２５】文書を予測関心度により自動分類するフローチャートである。
【図２６】文書間関連度を演算するフローチャートである。
【符号の説明】
１０本体、１１制御部、１２インターフェース、１３ＣＰＵ、２０入力部、２１受信部、３０表示部、３１記録／再生部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document processing method and apparatus for processing an electronic document, and a recording medium on which a document processing program for processing the electronic document is recorded.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, on the Internet, WWW (World Wide Web) is provided as an application service that provides hypertext type information in a window format.
[0003]
The WWW is a system that performs document processing for creating, publishing, or sharing a document, and shows a new style of document. However, from the viewpoint of practical use of documents, advanced document processing exceeding WWW, such as document classification and summarization based on document contents, is required. For such advanced document processing, mechanical processing of document contents is indispensable.
[0004]
However, mechanical processing of document contents is still difficult for the following reasons. First, HTML (Hyper Text Markup Language), which is a language for describing hypertext, defines the expression of a document but hardly specifies the contents of the document. Second, a hypertext network formed between documents is not always easy to use for the reader of the document to understand the content of the document. Third, in general, the author of the text writes without regard to the convenience of the reader, but the convenience of the reader of the document is not coordinated with the convenience of the author.
[0005]
As described above, the WWW is a system that indicates a new document. However, since the document is not mechanically processed, advanced document processing cannot be performed. In other words, in order to perform advanced document processing, it is necessary to mechanically process the document.
[0006]
Therefore, with the goal of mechanical processing of documents, a system that supports mechanical processing of documents has been developed based on the results of natural language research. As document processing by natural language research, mechanical document processing that uses the tag attached to the document on the premise of the attribute information on the internal structure of the document by the author of the document, so-called tag assignment, has been proposed. Yes.
[0007]
[Problems to be solved by the invention]
By the way, with the spread of computers in recent years and the progress of networking, it is necessary to increase the functionality of document processing that creates, labels, and changes text documents with text processing and indexing depending on the contents of the document. It has been. For example, document summarization or document classification according to the user's request is desired.
[0008]
The present invention is proposed in view of the above-described circumstances, and is a document processing method and apparatus for calculating a user's interest in a document, and a document processing program for calculating a user's interest in a document. Relates to a recording medium on which is recorded.
[0009]
[Means for Solving the Problems]
  In order to solve the above problems, a document processing method according to the present invention processes a plurality of electronic documents.Document processing equipmentIn the document processing method,A receiving unit receives a plurality of electronic documents, a recording unit records a plurality of electronic documents received in the receiving step, and a display unit is recorded in the recording step. A display step of displaying an electronic document selected by the user from among the plurality of electronic documents and a summary of the electronic document, and an input step of inputting user operation information for the electronic document displayed in the display step And the actual interest level detection means based on the user operation information input in the input step with respect to the electronic document displayed in the display step.For electronic documentsUser'sActual interestcalculateActual interest level detection process;The priority order setting means has a relevance level based on the internal structure of the electronic document among the electronic documents for which the actual interest level is calculated in the actual interest level detection step with respect to the electronic document for which the actual interest level is not calculated. Based on the predicted interest level, the actual interest level of the highest electronic document is defined as the predicted interest level.A priority setting step for setting priorities;A rearrangement step in which the rearrangement unit rearranges electronic documents not selected by the user among the plurality of electronic documents recorded in the recording step according to the priority set in the priority order setting step; In the actual interest level detection step, for each display area for displaying the electronic document and the summary of the electronic document, the appearance position in the electronic document among the elements of the electronic document selected by the user Is the first actual interest level element that is the ratio of the appearance position of the element farthest from the top of the electronic document and the size of the electronic document, the number of keywords specified by the user, and the element selected by the user A second actual interest level element of the actual interest level consisting of a number, a size of a display area of the summary of the electronic document displayed in the display step, and the electronic document It calculates the actual interest with a third of the actual interest element consisting of the ratio of the size of the display region.
[0010]
  A document processing apparatus according to the present invention processes a plurality of electronic documents.Document processing deviceInReceiving means for receiving a plurality of electronic documents; recording means for recording a plurality of electronic documents received by the receiving means; an electronic document selected by a user among the plurality of electronic documents recorded by the recording means; and Display means for displaying a summary of the electronic document, input means for inputting user operation information for the electronic document displayed by the display means, and input by the input means for the electronic document displayed by the display means Based on the user's operation informationFor electronic documentsUser'sActual interestcalculateReal interest level detection means;The actual interest of the electronic document having the highest relevance based on the internal structure of the electronic document among the electronic documents of which the actual interest level is calculated by the actual interest level detection unit with respect to the electronic document for which the actual interest level is not calculated Is the predicted interest level, based on the predicted interest levelPriority setting means for setting the priority;Rearrangement means for rearranging electronic documents not selected by the user among the plurality of electronic documents recorded by the recording means according to the priority order set by the priority order setting means, and the actual interest level For each display area for displaying the electronic document and the summary of the electronic document, the detection means has the most appearance position in the electronic document from the top of the electronic document among the elements of the electronic document selected by the user. A first actual interest level element composed of a ratio between an appearance position of an element at a distant position and the size of the electronic document, and the actual interest level composed of the number of keywords specified by the user and the number of elements selected by the user. The second actual interest level element, the size of the summary display area of the electronic document displayed by the display means, and the display area of the electronic document It calculates the actual interest with a third of the actual interest element consisting of the ratio of the size.
[0011]
  The recording medium according to the present invention processes a plurality of electronic documents.Let your computer perform document processingDocument processing program was recordedComputer readableIn the recording medium, the document processing program isA receiving step for receiving a plurality of electronic documents, a recording step for recording the plurality of electronic documents received in the receiving step, and a plurality of electronic documents in which the display means is recorded in the recording step. A display step of displaying an electronic document selected by the user and a summary of the electronic document, an input step of inputting user operation information for the electronic document displayed in the display step, and an actual interest The degree detection means is based on the user operation information input in the input step with respect to the electronic document displayed in the display step.For electronic documentsUser'sActual interestcalculateActual interest level detection process;The priority order setting means has a relevance level based on the internal structure of the electronic document among the electronic documents for which the actual interest level is calculated in the actual interest level detection step with respect to the electronic document for which the actual interest level is not calculated. The priority level setting step of setting the actual interest level of the highest electronic document as the predicted interest level and setting the priority order based on the predicted interest level, and the rearranging means include a plurality of electronic documents recorded in the recording step. The computer executes a rearrangement step of rearranging electronic documents not selected by the user according to the priority order set in the priority order setting step. In the actual interest level detection step, the electronic document and the electronic document For each display area for displaying the summary of the electronic document, the appearance position in the electronic document among the elements of the electronic document selected by the user is from the top of the electronic document. A first actual interest level element composed of a ratio of an appearance position of an element at a far position to the size of the electronic document, and the actual interest level composed of the number of keywords specified by the user and the number of elements selected by the user And the third actual interest level element comprising a ratio of the size of the display area of the summary of the electronic document displayed in the display step and the size of the display area of the electronic document. Calculate the actual interest level usingIs.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of a document processing method and apparatus and a recording medium according to the present invention will be described with reference to the drawings.
[0013]
As shown in FIG. 1, a document processing apparatus according to an embodiment of the present invention includes a main body 10 including a control unit 11 and an interface 12, an input unit 20 that receives input from a user and sends the input to the main body 10, and an external device. A receiving unit 21 that receives and transmits the signal to the main body 10, a display unit 30 that displays the output from the main body 10, and a recording / reproducing unit 31 that records / reproduces information on the recording medium 32. Yes.
[0014]
The main body 10 has a control unit 11 and an interface 12 and constitutes a main part of the document processing apparatus. The control unit 11 includes a CPU 13 that executes processing in the document processing apparatus, a RAM 14 that is a volatile memory, and a ROM 15 that is a nonvolatile memory. For example, according to the procedure recorded in the ROM 15, the CPU 13 temporarily stores data in the RAM 14 when necessary, and performs control for executing the program. The interface 12 is connected to the control unit 11, the input unit 20, the reception unit 21, the display unit 30, and the recording / playback unit 31. Under the control of the control unit 11, the interface 12 transmits data for data input from the input unit 20 and the reception unit 21, data transmission to the display unit 30, and data transmission / reception to the recording / playback unit 31. Adjust timing and convert data format.
[0015]
The input unit 20 is a part that receives user input to the document processing apparatus. The input unit 20 is configured by a keyboard or a mouse, for example. The user can use the input unit 20 to input a keyword using a keyboard, or to select and input an element of an electronic document displayed on the display unit 30 using a mouse. Hereinafter, an electronic document is simply referred to as a document. Here, an element is an element that constitutes a document, and includes, for example, a document, a sentence, and a word.
[0016]
The receiving unit 21 is a part that receives a signal transmitted to the document processing apparatus from the outside via, for example, a communication line. The receiving unit 21 receives a plurality of documents transmitted from the outside. The receiving unit 21 sends the received data to the main body 10.
[0017]
The display unit 30 displays the output of characters and image information from the document processing apparatus. The display unit 30 is composed of, for example, a cathode ray tube (CRT) or a liquid crystal display (LCD), and displays, for example, one or more windows, and displays characters, figures, etc. on the windows. To do.
[0018]
The recording / reproducing unit 31 records / reproduces data on / from a recording medium 32 such as a so-called floppy disk. The recording medium 32 records a document processing program for processing a document. The recording medium 32 will be further described later.
[0019]
Next, the document in the present embodiment will be described. In the present embodiment, document processing is performed with reference to a tag that is attribute information assigned to a document. The tags used in the present embodiment include a syntactic tag indicating the structure of the document, and a semantic word that enables understanding of the mechanical contents of the document between multiple languages. There is a logical tag.
[0020]
Some syntactic tags describe the internal structure of a document. As shown in FIG. 2, the internal structure by tagging is configured such that each element such as a document, a sentence, and a vocabulary element is linked by a normal link and a reference / referenced link. In the figure, a white circle “◯” indicates an element, and the lowest white circle is a vocabulary element corresponding to the word at the lowest level in the document. A solid line is a normal link indicating a connection between elements such as a document, a sentence, and a vocabulary element. A broken line is a reference link indicating a dependency relationship by reference / reference. The internal structure of the document is from top to bottom, from document, subdivision, paragraph, sentence, subsentential segment, ..., vocabulary elements Composed. Of these, subdivision and paragraph are optional.
[0021]
On the other hand, as semantic / pragmatic tagging, there is one in which information such as meaning is described like the meaning of a multiple meaning word. Tagging in the present embodiment is based on an XML (Extended Markup Language) format similar to HTML (Hyper Text Markup Language).
[0022]
An example of tagging is shown below, but tagging of documents is not limited to this method. Moreover, although the example of an English and Japanese document is shown below, the description of the internal structure by tagging can be applied to other languages similarly.
[0023]
For example, the sentence “Time flies like an arrow.” Can be tagged as follows.
[0024]
<Sentence> <noun phrase meaning = “time0”> time </ noun phrase>
<Verb phrase> <Verb meaning = “fly1”> flies </ verb>
<Adjective verb phrase> <adjective verb meaning = like0> like </ adject verb> <noun phrase> an <noun meaning = “arrow0”> arrow </ noun> </ noun phrase>
</ Adjective verb phrase> </ Verb phrase>. </ Sentence>
Where <sentence>, <noun>, <noun phrase>, <verb>, <verb phrase>, <adjective verb>, and <adjective verb phrase> are sentence, noun, noun phrase, verb, verb phrase, and adjective, respectively. It represents a syntactic structure of a sentence such as a prepositional phrase or a postpositional phrase / adjective phrase, an adjective phrase / adjective verb phrase. The tags are arranged correspondingly immediately before the end of the element and immediately after the end. A tag placed immediately after the end of the element indicates the end of the element by the symbol “/”. Elements represent syntactic constructs, ie phrases, clauses and sentences. Note that word sense = “time 0” indicates a plurality of meanings of the word “time”, that is, the 0th meaning among the plurality of meanings. Specifically, the word “time” has at least a noun, an adjective, and a verb meaning, but here the word “time” indicates a noun. Similarly, the word “orange” has at least a plant name, color, and fruit meaning, which can also be distinguished by meaning.
[0025]
The document according to the present embodiment can display the syntactic structure in the window 101 of the display unit 30 as shown in FIG. In this window 101, vocabulary elements are displayed on the right half surface 103, and the internal structure of the sentence is displayed on the left half surface 102.
[0026]
In this window 101, the following document with the internal structure described by tagging "C city where Mr. A's B meeting ended, some popular newspapers and general newspapers self-regulate their photo coverage Part of "Clarified policy on paper" is displayed. Here is an example of tagging this document:
[0027]
<Document> <Sentence> <Adjective Verb Phrase Relationship = “Position”> <Noun Phrase> <Adjective Verb Phrase Location = “C City”>
<Adjective verb phrase relation = “subject”> <noun phrase identifier = “B society”> <adjective verb phrase relation = “affiliation”> <person name identifier = “Mr. A”> </ adject verb of Mr. A </ person name> Phrase> <organization name identifier = “group B”> group B </ organization name> </ noun phrase> is </ adjective verb phrase>
</ Adjective verb phrase> <place name identifier = “C city”> C city </ place name> </ noun phrase> </ adject verb phrase> <adjective verb phrase relation = “subject”> <noun phrase identifier = "Press" Syntax = "Parallel"> <noun phrase> <adjective verb phrase> Some </ adject verb phrases> popular paper </ noun phrase> and <noun> general paper </ noun> </ noun phrase> </ Adjective verb phrase>
<Adjective verb phrase relation = “object”> <adjective verb phrase relation = “content” subject = “press”> <adjective verb phrase relation = “object”> <noun phrase> <adjective verb phrase> <noun co-reference = "B society"> So </ noun> </ adjective verb phrase> Photo report </ noun phrase> </ adjective verb phrase>
Self-regulating </ adjective verb phrase> policy </ adjective verb phrase>
<Adjective Verb Phrase Relation = “Position”> On the page </ Adjective Verb Phrase>
Revealed. </ Sentence> </ Document>
[0028]
In this document, “some popular papers and general papers” are represented as parallel by the tag “Syntax =“ Parallel ””. The definition of parallel is to share a dependency relationship. If nothing is specified, for example, <noun phrase relationship = x> <noun> A </ noun> <noun> B </ noun> </ noun phrase> indicates that A is dependent on B To express. Relation = x represents a relation attribute.
[0029]
Relational attributes describe the interrelationships between syntax, meaning, and rhetoric. Grammatical functions such as subject, object, indirect object, subject roles such as actors, activists, beneficiaries, and rhetorical relationships such as reasons, results, etc. are described by this relation attribute. In the present embodiment, relational attributes are described for relatively easy grammatical functions such as a subject, object, and indirect object.
[0030]
In this document, the attributes of proper nouns such as “Mr. A”, “Group B”, and “C City” are described by tags such as place names, person names, and organization names. Words to which tags such as place names, person names, and organization names are given are proper nouns.
[0031]
The operation of the document processing apparatus as an embodiment according to the present invention will be described below. The document processing device detects the actual interest level with respect to the document, and sets a priority order for other documents based on the detected actual interest level. The document processing apparatus displays a document and detects an actual interest level based on the displayed document. The actual interest level is detected according to the user's operation on the document. Based on the degree of association with the actual interest level, a predicted interest level is defined for a document for which no actual interest level is given. When the predicted interest level is used, a priority order can be given to a document that is not operated by the user.
[0032]
Prior to describing the actual interest level, manual document classification and automatic document classification will be described. That is, the operation of the document processing apparatus will be described in the order of (1) manual document classification, (2) automatic document classification, (3) actual interest level and predicted interest level.
[0033]
The contents of the explanation will be briefly described. (1) In the manual classification of a document, an operation will be described in which the document processing apparatus receives a document sent from the outside and the user manually classifies the document. With this manual classification, a classification model for classifying documents is created. (2) In the automatic document classification, an operation of classifying a document using the relevance between document classifications based on a classification model created by manual document classification will be described. (3) In the actual interest level and the predicted interest level, processing performed based on the actual interest level detected based on the user's operation and the predicted interest level obtained based on the actual interest level and the inter-document relevance level. Will be described.
[0034]
(1) Manual classification of documents
In this embodiment, there is no classification model in the initial state. In the initial state, it is necessary to manually classify documents sent from the outside in order to create a classification model. The manual classification operation of such a document processing apparatus will be described with reference to FIG.
[0035]
In step S11 of FIG. 4, the receiving unit 21 of the document processing apparatus receives a plurality of documents transmitted via a communication line, for example. The receiving unit 21 sends the received document to the main body 10 of the document processing apparatus.
[0036]
In step S12, the control unit 11 of the document processing apparatus extracts features of a plurality of documents sent from the receiving unit 21, and creates feature information, that is, an index of each document. The control unit 11 stores the received documents and the created index in, for example, the RAM 14. The index includes proper nouns, meanings other than proper nouns, and the like that are characteristic of the document.
[0037]
Here, a specific example of an index is shown.
[0038]
<Index date = "AAAA / BB / CC" Time = "DD: EE: FF" Document address = "1234">
<User operation history Maximum summary size = "100">
<Number of selected elements = “10”> Picturetel </ Selected>
...
</ User operation history>
<Summary> Tax reduction scale, untouched-Prime Minister X's meeting </ summary>
<Word meaning = “0003” Central activity value = “140.6”> Do not touch </ word>
<Word meaning = “0105” Identifier = “X” Central activity value = “67.2”> Prime Minister </ Word>
<Person name identifier = “X” Word meaning = “6103” Central activity value = “150.2”> Prime Minister X </ word / person name>
<Word meaning = “5301” central activity value = “120.6”> determined </ word>
<Word meaning = “2350” Identifier = “X” Central activity value = “31.4”> Prime Minister </ word>
<Word meaning = “9582” Central activity value = “182.3”> Emphasized </ Word>
<Word meaning = “2595” central activity value = “93.6”> touch </ word>
<Word meaning = "9472" Central activity value = "12.0"> Noticed </ Word>
<Word meaning = "4934" Central activity value = "46.7"> I didn't touch </ Word>
<Word meaning = “0178” central activity value = “175.7”> explained </ word>
<Word meaning = “7248” identifier = “X” central activity value = “130.6”> I </ word>
<Word meaning = “3684” Identifier = “X” Central activity value = “121.9”> Prime Minister </ word>
<Word meaning = “1824” central activity value = “144.4.”> Appealed </ word>
<Word meaning = “7289” central activity value = “176.8”> showed </ word>
</ Index>
[0039]
In this index, <index> and </ index> are the beginning and end of the index, <date> and <time> are the date and time this index was created, and <summary> and </ summary> are It shows the beginning and end of the summary of the contents of this index. <Word> and </ Word> indicate the beginning and end of a word, respectively. The meaning = “0003” indicates the third meaning. The same applies to other cases. That is, since the same word may have a plurality of meanings, a number is predetermined for each meaning in order to distinguish them. Therefore, one or more meanings exist for the same word.
[0040]
<User operation history> and </ User operation history> indicate the start and end of the user operation history, and <Select> and </ Select> indicate the start and end of the selected element, respectively. ing. The maximum summary size = “100” indicates that the maximum size of the summary is 100 characters, and the number of elements = “10” indicates that the number of selected elements is 10.
[0041]
In step S13 in FIG. 4, the user browses the document displayed on the display unit 30 of the document processing apparatus as shown in the specific example of the display in FIG. In FIG. 5, the document before classification by the user is classified as “other topics”, and the icon and title of the document are displayed in “other topics” of the first display unit 303 of the window 301. The control unit 11 of the document processing apparatus controls the display unit 30 to display a document desired by the user from among the plurality of documents displayed in this way. The control unit 11 selects a document to be displayed on the display unit 30 in accordance with a user input to the input unit 20. The display unit 30 displays the document selected by the user in a window that can change the size of the area. When the entire document cannot be displayed in this window, a part of the document is displayed.
[0042]
Note that step S13 in which the user browses the document is provided according to the user's needs. In the figure, the step S13 represented by a parallelogram indicates that the user operates. The same applies to the following.
[0043]
Here, a specific example of the display shown in FIG. 5 will be described in detail. In this specific example, a user can freely set or change a category for classifying documents. Such a category setting or change is manually performed by the user.
[0044]
A specific example of a graphic user interface (GUI) used for displaying the document classification in the display unit 30 is as shown in FIG. The document classification window 301 includes a position reset button for returning the window state to the initial position, a browser button for calling a browser for browsing the contents of the document, and an escape from this window. And an operation button 302 including an (exit) button.
[0045]
The document classification window 301 also includes a first classification display section 303 that displays the above-mentioned “other topics”, a second classification display section 304 that displays “business news”, and a third classification that displays “political news”. The category display section 305 and the like are displayed. In these classification sections, icons of documents and titles of documents classified into the categories are displayed corresponding to the respective categories. If there is no title, a one sentence summary is displayed. The size of each classification display unit is not fixed, and can be changed to a desired size by operating the mouse of the input unit 20, for example. Also, the title or label of the classification display section can be freely changed.
[0046]
In the “other topics” of the first classification display unit 303, for example, the titles of documents before being classified into categories corresponding to the second classification display unit 304 and the following are displayed. That is, in the manual classification process, the document received by the document processing apparatus is once displayed in “other topics” of the first classification display unit 303. The documents displayed on the first classification display unit 303 are classified into categories by the user as follows.
[0047]
In step S14 of FIG. 4, the user creates a classification model composed of a plurality of categories for classifying the plurality of documents viewed on the display unit 30 of the document processing apparatus in step S13. Then, the plurality of documents are classified into each category of the classification model.
[0048]
The classification model is composed of a plurality of classification items, that is, categories, for classifying documents. The category is composed of a category index including proper nouns, meanings other than proper nouns, document addresses included in the categories, and the like characteristic of the category. The category index is composed of a document index including proper nouns and meanings other than proper nouns.
[0049]
For example, the classification model shown in FIG. 7 has columns of proper nouns, meanings other than proper nouns, and document addresses for category indexes corresponding to the respective categories. In this classification model, the proper nouns “Mr. A,...”, “Mr. B,. .. ”,“ C company, G company,... ”,“ D type,... ”,“ Mr. E,... ”And“ Mr. F ”, meaning“ baseball (4546), ground (2343) ), ... "," Labor (3112), Unique (9821), ... "," Mobile (2102), ... "," Cherry 1 (11111), Orange 1 (9911) "," Cherry 2 (11112), orange 2 (9912), and “cherry tree 3 (11113)” as document addresses “SP1, SP2, SP3,...”, “SO1, SO2, SO3,.・ ”,“ CO1, CO2, CO3,... ”,“ PL1, PL2, PL3 ··· "," AR1, AR2, AR3, ··· "and" EV1, EV2, EV3, each have a ... ". “Sakura 1”, “Sakura 2”, and “Sakura 3” indicate the first meaning (11111), the second meaning (11112), and the third meaning (11113) of “Sakura”. “Orange 1” and “Orange 2” indicate the first meaning (9911) and the second meaning (9912) of “orange”. For example, “Orange 1” represents a plant orange and “Orange 2” represents an orange color.
[0050]
When the classification model is updated, the update date and time is recorded in the classification model. In the figure, “Dec. 10, 1998 19:56:10” is recorded as the update date.
[0051]
The category of the classification model is manually created by the user by changing or deleting the classification display section corresponding to each category or setting a new classification display section in the document classification window 301.
[0052]
For example, in the document classification window 301, a classification display unit corresponding to a desired category is displayed by using the mouse corresponding to the title of the document displayed on the classification display unit in the document classification window 301. This is done by dragging to. The titles of documents classified into categories are displayed on the classification display unit corresponding to each category in the document classification window 301.
[0053]
In step S15, the control unit 11 of the document processing apparatus creates a classification model based on the creation of the category performed in step S14 and the index of each document classified by the user's manual classification operation according to this category. Create That is, the control unit 11 of the document processing apparatus collects the indexes of the plurality of documents classified into each category, and generates a classification model.
[0054]
The category index of each category includes a proper noun characteristic for the category, a meaning other than the proper noun, and a document address classified into each category. Here, in the case of other than proper nouns, the meaning is used instead of the word itself because the same word may have a plurality of meanings. Then, the control unit 11 of the document processing apparatus stores the classification model created in this way in, for example, the RAM 14.
[0055]
The classification model can be created in step S15 each time a category is created in step S14 and a user's manual classification operation is performed.
[0056]
In step S16, the control unit 11 of the document processing apparatus registers the classification model created in step S15. The control unit 11 stores the registered classification model in, for example, the RAM 14.
[0057]
(2) Automatic document classification
Next, automatic document classification performed by the document processing apparatus based on the classification model will be described with reference to FIG. This document classification is performed on a document received after the classification model is created by the processing shown in FIG. In this example, it is assumed that the processing shown in FIG. 8 is performed every time one document is received. However, the processing may be performed every time a predetermined number of documents are received, or the user can display the screen shown in FIG. Processing may be performed on all documents received so far when an operation of opening is performed.
[0058]
In step S21, the receiving unit 21 of the document processing apparatus receives a document from the outside. Since the reception of this document has been described in step S11, the description thereof will be omitted here.
[0059]
In step S22, the control unit 11 of the document processing apparatus reads the document stored in the RAM 14 in step S21 and creates an index. The creation of this index will be further described later.
[0060]
In step S23, the control unit 11 of the document processing device automatically classifies each indexed document into any category of the classification model based on the classification model. Then, the control unit 11 stores the classification result in, for example, the RAM 14. Details of the automatic classification will be described later.
[0061]
In step S24, the control unit 11 of the document processing apparatus updates the classification model based on, for example, the result of automatic classification of a new document stored in the RAM 14 in step S23. In step S25, the control unit 11 of the document processing apparatus registers the classification model updated in step S24. The control unit 11 stores the registered classification model in, for example, the RAM 14.
[0062]
Next, index creation in step S12 of FIG. 4 and step S22 of FIG. 8 will be described with reference to FIG.
[0063]
In step S31, the control unit 11 of the document processing apparatus performs active diffusion for diffusing the central activation value of the element based on the internal structure of the document received in step S11 of FIG. 4 and step S21 of FIG. Execute. The center activity value diffusion process will be described later. The control unit 11 stores the central activity value of each element obtained as a result of active diffusion in, for example, the RAM 14.
[0064]
In step S32, the control unit 11 of the document processing device extracts elements whose center activity value exceeds a preset threshold based on the center activity value of each element obtained in step S31. The control unit 11 stores the extracted elements in, for example, the RAM 14.
[0065]
In step S33, the control unit 11 of the document processing apparatus reads the element extracted in step S32 from the RAM 14, for example. Then, the control unit 11 extracts all proper nouns from this element and adds them to the index. Proper nouns have no special meaning and have special properties such as not being listed in the dictionary, so they are handled separately from words other than proper nouns. Here, the word meaning corresponds to each meaning of a plurality of meanings of the word.
[0066]
The control unit 11 of the document processing apparatus determines whether or not the element is a proper noun based on a tag attached to the received document. For example, in the internal structure with tagging shown in FIG. 3, “Mr. A”, “Group B”, and “C City” have the relationship attributes of “person name”, “organization name”, and “place name”, respectively. It can be seen that it is a proper noun. And the control part 11 adds the taken-out proper noun to an index, and memorize | stores the result in RAM14, for example.
[0067]
In step S34, the control unit 11 of the document processing apparatus extracts the meaning other than the proper noun from the element extracted in step S32 from the RAM 14, for example, adds it to the index, and stores the result in the RAM 14.
[0068]
As described above, the procedure for finding the feature of the document and creating the index is to find the feature of the tagged document and create an index in which the feature is arranged. The feature of the document is determined based on the central activity value that has been subjected to the diffusion processing according to the internal structure of the document.
[0069]
Note that the above-described index includes a document address indicating a position where the document is stored in the RAM 14 together with a meaning and proper noun representing the document characteristics.
[0070]
Since the index includes meanings and proper nouns representing features that represent the document, it can be used when referring to a desired document.
[0071]
Next, active diffusion for diffusing the central active value corresponding to the element based on the internal structure of the document will be described with reference to FIG. The active diffusion is performed in step S31 of FIG. The active diffusion is a process for giving a high central activity value to an element related to an element having a high central activity value. Since this central activity value is determined according to the internal structure by tagging, it is used for extracting document features and the like.
[0072]
In step S81, the control unit 11 of the document processing apparatus sets the end point activation value of the end point of the link connecting the elements to 0 for the reference / referenced link and the normal link. The control unit 11 stores the initial value of the end point activation value thus assigned, for example, in the RAM 14.
[0073]
The connection between the elements is as shown in FIG. 11, for example. In this figure, element E is part of the structure of the elements and links that make up the document._iAnd element E_jIt is shown. Element E_iAnd element E_jAnd the central activity value e_iAnd e_jEach with a link L_ijConnected at. Link L_ijElement E_iThe end point connected to is T_ij, Element E_jThe end point connected to is T_jiIt is. Element E_iIs the link L_ijE connected by_jIn addition to link L_ik, L_ilAnd L_imElement E (not shown)_k, E_lAnd E_mIs connected to each. Element E_jIs element E_jLink L based on_ijL_jiE connected by_iIn addition to link L_jp, L_j _qAnd L_jrElement E (not shown)_p, E_qAnd E_rIs connected to each.
[0074]
In step S82, the control unit 11 of the document processing apparatus performs element E constituting the document._iInitialize the counter that counts. That is, the count value i of the counter for counting elements is set to 1. This counter is the first element E₁Will be referred to.
[0075]
In step S83, the control unit 11 of the document processing apparatus executes a link process for calculating a new center activation value for the element referred to by the counter. This link process will be further described later.
[0076]
In step S84, the control unit 11 of the document processing apparatus determines whether or not the calculation of new center activation values has been completed for all elements in the document. Then, when the calculation of the central activation value for all elements in the document is completed, the control unit 11 proceeds to step S85 as “YES”, and the calculation of the new central activation value for all elements in the document is completed. If not, "NO" is determined and the process proceeds to step S87.
[0077]
Specifically, the control unit 11 determines whether or not the count value i of the counter has reached the total number of elements included in the document. Then, when the count value i of the counter reaches the total number of elements included in the document, the control unit 11 determines that all elements have been calculated and proceeds to step S85. When the count value i of the counter does not reach the total number of elements included in the document, the control unit 11 determines that calculation has not been completed for all elements and proceeds to step S87.
[0078]
In step S87, the control unit 11 of the document processing apparatus increments the count value i of the counter by 1, and sets the count value of the counter to i + 1. As a result, the counter becomes i + 1th E_{i + 1}The next element is referred to. Then, the process returns to step S83, and the calculation of the end point activation value and the subsequent series of steps are performed to determine the next i + 1th element E._{i + 1}To be executed.
[0079]
In step S85, the control unit 11 of the document processing apparatus calculates the average value of the change in the central activity value of all elements included in the document, that is, the change in the newly calculated central activity value with respect to the original central activity value. Calculate
[0080]
The control unit 11 of the document processing apparatus reads, for example, the original central activity value stored in the RAM 14 and the newly calculated central activity value for all elements included in the document. The control unit 11 divides the sum of the respective changes of the newly calculated center activity value with respect to the original center activity value by the total number of elements included in the document, so that the change amount of the center activity value of all the elements is calculated. Calculate the average value. The control unit 11 stores, for example, the RAM 14 in the average value of the change in the central activity value of all the elements calculated in this way.
[0081]
In step S86, the control unit 11 determines whether or not the average value of the change in the central activity value of all the elements calculated in step S89 is within a preset threshold value. And the control part 11 will complete | finish this series of steps as "YES", when the said change is less than a threshold value. When the change is not within the threshold value, the control unit 11 determines “NO” and sets the counter count value i to 1 in step S82, and again executes a series of steps for calculating the central activation value of the document element. To do. Each time the loop from step S82 to step S84 configured in this series of steps is repeated, the amount of change gradually decreases.
[0082]
Next, the link process executed in step S83 in FIG. 10 will be described with reference to FIG. Here, one element E_iAs an example, in the case of the center active value diffusion process, the link process is performed for all elements.
[0083]
In step S51, the control unit 11 of the document processing apparatus performs element E constituting the document._iAnd a counter that counts the links to which one end is connected is initialized. That is, the count value j of the counter for counting links is set to 1. Counter is element E_iThe first element L connected to_i1Will be referred to.
[0084]
In step S52, the control unit 11 of the document processing apparatus performs element E._iAnd E_jLink L connecting_ijWith reference to the tag of the relation attribute, it is determined whether or not the link is a normal link. The control unit 11 is connected to the link L_ijWhen is a normal link, “YES” is determined, and the process proceeds to step S53._ijIf is a reference link, "NO" is determined and the process proceeds to step S54.
[0085]
In step S53, the control unit 11 of the document processing apparatus performs element E._iNormal link L_ijEnd point T connected to_ijThe process of calculating a new end point activation value of is performed.
[0086]
Here, the link L is determined by the determination in step S52._ijIs usually a link. Element E_iNormal link L_ijEnd point T connected to_ijEnd point activation value t_ijIs element E_jLink L of the end point activation values of_ijAll end points T connected to links other than_jp, T_jq, T_jrEnd point activation value t_jp, T_jq, T_jrAnd element E_iIs link L_ijE connected by_jCenter activity value e_jAnd the value obtained by this addition is divided by the total number of elements included in the document.
[0087]
The control unit 11 of the document processing apparatus reads the end point activation value and the center activation value from the RAM 14, for example. The control unit 11 calculates a new end point activation value of the end point connected to the normal link as described above with respect to the read end point activation value and center activation value. And the control part 11 memorize | stores the endpoint active value calculated in this way in RAM14, for example.
[0088]
In step S54, the control unit 11 of the document processing apparatus performs element E._iEnd point T connected to the reference link of_ijThe process of calculating the end point activation value of is performed.
[0089]
Based on the determination in step S52, the link L_ijIs a reference link. Element E_iReference link L_ijEnd point T connected to_ijNew end point activation value t_ijIs element E_jThis link L of the end point activation values of_ijAll end points T connected to the link excluding_jp, T_jq, T_jrEnd point activation value t_jp, T_jq, T_jrAnd element E_iIs link L_ijE connected by_jCenter activity value e_jIs obtained by adding.
[0090]
The control unit 11 of the document processing apparatus reads out necessary end point activation values and center activation values from, for example, end point activation values and center activation values stored in the RAM 14. The control unit 11 calculates a new end point activation value connected to the reference link as described above, using the read end point activation value and center activation value. And the control part 11 memorize | stores the endpoint active value calculated in this way in RAM14, for example.
[0091]
The normal link process in step S53 and the reference link process in step S54 are in the loop from step S52 to step S55, and the element E referred to by the count value i is used._iAll links L connected to_ijIs executed against.
[0092]
In step S55, the control unit 11 of the document processing apparatus performs element E._iIt is determined whether or not endpoint activation values have been calculated for all links connected to. If end point activation values are calculated for all links, the process proceeds to step S57 as “YES”, and if end point activation values are not calculated for all links, the process proceeds to step S57 as “NO”.
[0093]
In step S56, element E in step S55._iAll links of L_ijEnd point activation value t_ijTherefore, the control unit 11 of the document processing apparatus determines that the element E_iCenter activity value e_iPerform the update.
[0094]
Element E_iCenter activity value e_iThe new or updated value of is the element E_iCurrent central activity value e_iAnd element E_iThe sum of new endpoint activation values for all endpoints of_i‘= E_i+ Σt_jIt is calculated | required by taking '. Here, the prime “′” means a new value.
[0095]
The control unit 11 of the document processing apparatus reads a necessary end point activation value from, for example, the end point activation value and the center activation value stored in the RAM 14. The control unit 11 executes the calculation as described above, and the element E_iCenter activity value e_iIs calculated. Then, the control unit 11 calculates the new center activity value e_iIs stored in the RAM 14, for example.
[0096]
Next, automatic classification in step S23 of FIG. 8 will be described with reference to FIG.
[0097]
In step S71, the control unit 11 of the document processing apparatus performs category C of the classification model._iFor the set of proper nouns and the set of proper nouns of the words extracted from the document received in step S21 and put into the index, the number of these common sets is expressed as P (C_i ). The control unit 11 then calculates the number P (C_i ) Is stored in the RAM 14, for example.
[0098]
In step S72, the control unit 11 of the document processing apparatus determines all meanings included in the index of the document and each category C._iThe relationship between meanings with all meanings included in the word meaning is referred to a table of relationship between meanings shown in FIG._i ) Is calculated. That is, the control unit 11 sums up the degree of relevance between all meanings R (C for words other than proper nouns in the classification model._i ) Is calculated. Then, the control unit 11 calculates the total sum R (C_i ) Is stored in the RAM 14, for example.
[0099]
In step S73, the control unit 11 of the document processing apparatus performs category C._i Relevance between document classifications
Rel (C_i ) = M₁P (C_i ) + N₁R (C_i )
It is defined as Where the coefficient m₁, N₁Is a constant and represents the degree of contribution of each value to the degree of association between document classifications. The control unit 11 determines the number P (C of common sets calculated in step S72._i ) And the sum R of meanings between meanings calculated in step S73 (C_i ) Is read from, for example, the RAM 14, and is applied to the above-described formula to determine the degree of relationship between document classifications Rel (C_i ) Is calculated. These coefficients m₁, N₁For example, m₁= 10, n₁= 1. Then, the control unit 11 calculates the relevance level between document classifications Rel (C_i ) Is stored in the RAM 14, for example.
[0100]
Coefficient m₁And n₁The value of can also be estimated using statistical techniques. That is, the control unit 11 determines the relationship between document classifications Rel (C_i ) Can be obtained by optimization.
[0101]
In step S74, the control unit 11 of the document processing apparatus performs category C._iRelevance between document classifications Rel (C_i ) Is the largest, and the degree of relevance Rel (C_i ) Value exceeds a certain threshold, the category C_iSort documents into That is, the control unit 11 creates a document category relevance level for each of a plurality of categories, and when the maximum document category relevance level exceeds a threshold, the document has the maximum document category relevance level. Category C_iClassify into: When the maximum degree of association between document classifications does not exceed the threshold value, document classification is not performed.
[0102]
Next, the calculation of the degree of association between meanings used in step S72 of FIG. 13 will be described with reference to FIG. The process shown in FIG. 14 need only be performed once before the process shown in FIG.
[0103]
In step S <b> 61, the control unit 11 of the document processing apparatus creates a meaning network using this dictionary using the explanation of the meaning of words in the electronic dictionary. That is, a meaning network is created from the reference relationship between each meaning in the dictionary and the meaning that appears in the description. As a result, a tree-like semantic network with the dictionary as the highest vertex is constructed. The internal structure of the network is described by tagging as described above. The control unit 11 of the document processing apparatus creates a network by sequentially reading the meaning and explanation of the electronic dictionary stored in the RAM 14, for example. The control unit 14 stores the semantic network created in this way in, for example, the RAM 14.
[0104]
The network is created by the control unit 11 of the document processing apparatus using a dictionary, received from the outside by the receiving unit 21, or reproduced from the recording medium 32 by the recording / reproducing unit 31. Can also be obtained. The dictionary is obtained by being received from the outside by the receiving unit 21 or reproduced from the recording medium 32 by the recording / reproducing unit 31.
[0105]
In step S62, the central activation value corresponding to each semantic element is diffused on the semantic network created in step S61. By this active diffusion, the central activity value corresponding to each meaning is given according to the internal structure by tagging given by the dictionary. The center activity value diffusion process will be described later.
[0106]
In step S63, one meaning s constituting the meaning network created in step S61._iIn step S64, this one meaning s_iVocabulary element E corresponding to_iCenter activity value e_iThe initial value of the center activation value at this time is changed._iCalculate
[0107]
In step S65, element E in step S64._iCenter activity value e_iDifference Δe_iOther meanings corresponding to_jElement E corresponding to_jCenter activity value e_jDifference Δe_jAsk for. In step S66, the difference Δe obtained in step S65._jΔe obtained in step S64_iQuotient Δe divided by_j/ Δe_i, Meaning_iMeaning of_jIs the degree of relevance between meanings.
[0108]
In step S67, one meaning s_iAnd other meanings_jIt is determined whether or not the calculation of the relationship between meanings is completed for all pairs. Then, when the calculation of the relationship between meanings is completed for all the meaning pairs, this processing is ended as “YES”. When the calculation of the relationship between meanings is not completed for all semantic pairs, “NO” is returned to step S63, and the calculation of the relationship between meanings is continued for the pair whose calculation of the relationship between meanings is not completed. To do.
[0109]
In the loop from step S63 to step S67, the control unit 11 of the document processing apparatus reads necessary values sequentially from the RAM 14, for example, and calculates the degree of association between meanings as described above. The control unit 11 stores the calculated degree-of-sense relevance in order, for example, in the RAM 14.
[0110]
The degree of association between meanings calculated in this way is defined between each meaning and meaning as shown in FIG. In this table, the degree of association between meanings is normalized so as to take a value from 0 to 1. In this table, the degree of relevance between meanings among “computer”, “television”, and “VTR” is shown. The degree of association between the meanings of “computer” and “TV” is 0.55, the degree of association between the meanings of “computer” and “VTR” is 0.25, and the degree of association between the meanings of “TV” and “VTR” is 0.60. It is.
[0111]
(3) Actual interest level and predicted interest level
Next, details of step S13 in FIG. 4 will be described with reference to FIG. By performing this processing, the actual interest level is detected.
[0112]
In step S101, the user selects a desired document from the document classification window 301 shown in FIG. For example, the user selects an icon corresponding to the title of the document displayed in the classification display section of the document classification window 301 with the mouse of the input section 20. Then, by selecting the “browser” button of the operation button 302, the process proceeds to the display step of the next step S102.
[0113]
In step S102, the control unit 11 of the document processing apparatus reads the document selected by the user in step S101 from the RAM 14, for example. In the display unit 30, the control unit 11 displays the read document on the document display unit 53 of the window 51. As described above, when the entire document cannot be displayed on the document display unit 53 of the window 51, a part of the document is displayed.
[0114]
In step S <b> 103, the user reads the document displayed in the document display unit 53 of the window 51 in step S <b> 102 and creates a summary. That is, the user reads the document on the document display unit 53 of the window 51 displayed in step S102. In addition, the user selects the “summerize” button of the operation button 56 of the window 51 to display the summary of the document displayed on the document display unit 53 on the summary display unit 54.
[0115]
Here, when a summary is created and displayed on the summary display unit 54, a procedure for increasing the importance of the element selected by the user in the document with respect to the document displayed on the document processing unit 53 by the user's operation is shown in FIG. This will be described with reference to the flowchart shown in FIG.
[0116]
In the first step S91, the control unit 11 determines whether or not an element in the document has been selected by the user. This determination is made by selection using a graphic user interface (GUI) shown in FIG. 18 that accepts user input.
[0117]
The window 51 includes a file name display unit 52 for displaying the file name of the document, a document display unit 53 for displaying the document with the file name displayed on the file name display unit 52, and the document displayed on the document display unit 53. A summary display unit 54 for displaying the summary is provided. The document display unit 53 displays all or part of the document in which the file name or the first part of the document is displayed in the file name display unit 52. When only a part of the document is displayed on the document display unit 53, for example, by scrolling the document displayed on the document display unit 53, the entire document can be browsed sequentially. The summary display unit 53 displays a summary of the document displayed on the document display unit 53 by processing to be described later, corresponding to the size of the summary display unit 54. The summary display section 53 is blank because no summary has been created. Note that the sizes of the document processing unit 53 and the summary display unit 54 can be changed. The document handled in the window 51 is, for example, received by the receiving unit 21 of the document processing apparatus and recorded in the recording / reproducing unit 31 or the RAM 14.
[0118]
The window 51 includes a keyword input unit 55 for inputting a keyword and a button unit 56 having a plurality of buttons. By inputting a keyword into the keyword input unit 55, the importance of a word having a high degree of association with the keyword among the words displayed on the document display unit 54 is increased. The button unit 56 includes an “Undo” button for returning the execution result, and a process of summarizing the text displayed on the document display unit 53 and displaying it on the summary display unit 54. summarize) ”button. Among these, by selecting the “summary” button, for example, when the size of the summary display portion 54 is changed, the summary is displayed on the document processing portion 53 so as to correspond to the new size of the new summary display portion 54. A summary of the current document is generated, and the generated summary is displayed on the summary display unit 54.
[0119]
In step S91 in FIG. 17, the control unit 11 determines whether or not an element in the text displayed on the document display unit 53 has been selected by the user in the window 51 displayed on the display unit 30 of the document processing apparatus. . The input unit 20 of the document processing apparatus that selects and inputs an element in the document display unit 53 is performed by operating a cursor displayed on the display unit 30 linked to the pointing device using a pointing device. Can do. For example, when a mouse is used as the pointing device, the mouse is operated to move the cursor to a desired element of the document processing unit 53, and the element is selected by clicking with the mouse. When an element is selected in the document display section 53, the selected element is highlighted, for example, in order to clearly show the selected element. In FIG. 19, in the document display section 53 of the window 51, the vocabulary element “mainframe” 57 which is the smallest selected element is highlighted. The summary display section 53 is blank because no summary has been created yet. When the element is selected in this way, the control unit 11 determines “YES” and proceeds to the next step S92. When no element is selected, for example, when there is no input within a predetermined time, or when a portion other than the portion where the text is displayed in the document display unit 53 is clicked with the mouse, the control unit 11 again sets “NO” to this step S91. The processing is returned to and the input of the element is waited. In the following description, for convenience of explanation, the description will be made assuming that a mouse is used as the pointing device of the input unit 20.
[0120]
In step S92, the control unit 11 of the document processing apparatus determines whether the word has been selected in step S91 but has been selected by clicking with the mouse in the past. When the element is an element selected by clicking with the mouse in the past, the control unit 11 determines “YES” and advances the process to step S93. If the element is not the element selected by clicking with the mouse in the past, the control unit 11 advances the process to step S94 as “NO”.
[0121]
In step S93, the control unit 11 of the document processing apparatus determines whether the selected element is a text element. When the level is a text element, the control unit 11 returns “YES” to step S91. When the level is not a text element, the control unit 11 determines “NO” and proceeds to the next step S95.
[0122]
In step S94, the control unit 11 of the document processing apparatus sets the level to the vocabulary element that is the minimum element of the document and the lowest element of the internal structure by tagging the document. And the control part 11 returns a process to step S91.
[0123]
In step S95, the control unit 11 of the document processing apparatus increases the level by one. For example, when the level is increased by 1, the vocabulary element “mainframe” 57 selected in step S91 is, as shown in FIG. 20, the next higher order element “Big mainframe computers” including this vocabulary element. “59” is selected, and this portion “Big mainframe computers” 59 is highlighted. At the same time, the control unit 11 increases the weight of the selected upper element, that is, the central activity value higher than that of the element not selected. And the control part 11 returns a process to step S11.
[0124]
When the “summary” button displayed on the button section 56 of the window 51 is selected by clicking the mouse, the summary of the text displayed on the document display section 53 is displayed on the summary display section 54. When the “summary” button is selected, the control unit 11 controls the process to exit from the series of steps shown in FIG. 17 by interruption, and starts a process for creating a summary. The summary is generated from the document displayed on the document display unit 53 so as to fill the area of the summary display unit 54 in accordance with the size of the summary display unit 54. As shown in FIG. 21, the summary displayed on the summary display unit 54 displays an element “Big mainframe computers” 60 corresponding to the element “Big mainframe computers” 59 highlighted on the document display unit 59. Yes. Thus, by selecting a desired element in the document display unit 53 of the window 51 and increasing the importance, the possibility that the element is included in the summary can be increased. Details of the summary generation will be described later.
[0125]
In the window 51 shown in FIG. 18, selection of elements in the document displayed on the document display unit 53 can be performed by inputting a keyword to the keyword input unit 55 in addition to clicking with the mouse. The control unit 11 performs a process of increasing the importance of the element related to the keyword input to the keyword input unit 55 in this way. The degree of association between the keyword and the element is obtained by referring to a table recorded in the ROM 15, for example. This reference is performed by referring to the element including the keyword by tagging.
[0126]
In step S104 of FIG. 16, the operation unit 11 of the document processing apparatus calculates the actual interest level of the user's document. The actual interest level is calculated based on an operation on the document displayed in the user window 51 in step S103.
[0127]
Here, the actual interest level and the predicted interest level used in the present embodiment will be described. The actual interest level is calculated in step S104 and is an actual interest level for the document operated by the user, which is detected by the user operation. On the other hand, the predicted interest level is a prediction of the interest level of the user's document. This predicted interest level is predicted based on the actual interest level, for example.
[0128]
In step S105, the control unit 11 records the user operation history in the index. In the specific example of the index described above, as the user operation history,
<User operation history Maximum summary size = "100">
<Number of selected elements = “10”> Picturetel </ Selected>
...
</ User operation history>
Was exemplified. In step S105, the control unit 11 updates the operation history such as the maximum size of the summary, the selected element, and the number of selected elements. The control unit 11 stores the updated index in, for example, the RAM 14.
[0129]
The index can include the actual interest level of the document. For example, the actual interest level for each document may be included in the index for each category. In such a case, in step S105, the actual interest level itself included in the index related to the document is also updated.
[0130]
Next, the user operation in step S103 of FIG. 16 will be described with reference to FIG. 22, FIG. 23, FIG. 23, and FIG.
[0131]
The document whose title is displayed in the document classification window 301 can be displayed on the display unit 30 by selecting on the display unit 30 using the mouse of the input unit 20, for example. A specific example of the document display window for displaying the document in this way is shown in FIG. 18, and the description thereof is omitted here.
[0132]
Next, an example including a more detailed control than that shown in FIG. 4 of the process of creating a summary will be described in detail with reference to the flowchart shown in FIG. This series of steps is started by turning on the “Summary” button 103.
[0133]
The process of creating a summary from a document is performed based on the internal structure by tagging the document. As described above, the size of the display area 130 for displaying the summary in the window 100 can be changed. When the window 101 is newly drawn in the window 100 of the display unit 30 or the size of the display area 130 is changed and the execution button 103 is operated, the control unit 11 of the document processing apparatus adapts to the display area 130. A process of creating a summary from the document displayed in the display area 120 of the window 100 is executed.
[0134]
In the first step S120 in FIG. 22, the control unit 11 of the document processing apparatus performs active diffusion. In the present embodiment, documents are summarized by adopting the central activity value obtained by active diffusion as the importance. That is, in a document given an internal structure by tagging, a central activity value corresponding to the internal structure by tagging can be given to each element by performing a process called active diffusion. The active diffusion is a process for giving a high central activity value to an element related to an element having a high central activity value. That is, active diffusion is an operation on a central activity value in which the central activity value is equal between the anaphoric (co-reference) expression and its antecedent, and otherwise the central activity value is attenuated. Since this central activity value is determined according to the internal structure by tagging, it can be used for the analysis of documents in consideration of the internal structure by tagging.
[0135]
In step S121, the control unit 11 of the document processing apparatus sets the size of the document processing unit 53 of the window 51 displayed on the display unit 30, specifically, the maximum number of characters that can be displayed on the document processing unit 53 as w._sAnd set. Further, the control unit 11 of the document processing apparatus initializes s for storing the summary character string and initializes the initial value s.₀Set “=”. The control unit 11 sets the maximum number of characters w that can be displayed on the document display unit 53 as set above._sAnd an initial value s for storing the summary string₀Is recorded in the RAM 14, for example.
[0136]
In step S122, the control unit 11 of the document processing apparatus sets the count value i of the counter that counts the sequential creation of the summary skeleton to zero. That is, the control unit 11 sets i = 0 for the count value. The control unit 11 records the count value i set in this way in, for example, the RAM 14.
[0137]
In step S123, the control unit 11 of the document processing apparatus extracts the skeleton of the sentence having the i-th average central activity value from the sentence with respect to the count value i of the counter. The average central activity value is an average of the central activity values of the elements constituting one sentence. The control unit 11 stores the summary recorded in the RAM 14, for example._i-1Read this_i-1Add the extracted sentence skeleton to s_iAnd Then, the control unit 11 obtains s thus obtained._iIs recorded in the RAM 14, for example. At the same time, the control unit 11 makes a list l in the order of the central activation values of elements not included in the skeleton of the sentence._iCreate this list_iIs recorded in the RAM 14, for example.
[0138]
That is, in this step S123, the summary algorithm selects sentences in descending order of the average central activity value using the result of active diffusion, and extracts the skeleton of the selected sentence. The skeleton of the sentence is composed of essential elements extracted from the sentence. The required elements can be the element head, subject, object, indirect object, possessor, cause, condition Alternatively, an element having a relation attribute of comparison and an element directly included in the case where the coordinate structure is an essential element constitute an essential element. Then, the skeleton of the sentence is generated by connecting the essential elements of the sentence and added to the summary.
[0139]
In step S124, the control unit 11 of the document processing apparatus performs s_iIs the maximum number of characters w in the summary display section 54 of the window 51_sJudge whether it is larger. And the control part 11 is s_iIs the maximum number of characters w_sWhen it is larger, “YES” is set and this series of processing is ended. The control unit is s_iIs the maximum number of characters w_sIf not, the process proceeds to step S125 as “NO”. In other words, in step S124, the process ends when the summary reaches the designated amount. If there is still room, the sentence with the highest central activity value is compared with the central activity value of the omitted element, and the higher one is added to the summary.
[0140]
In step S129, the control unit 11 of the document processing apparatus performs s in step S124._iIs the maximum number of characters w_sSince it was determined to be greater than_i-1Set to. In this case, the summary does not fit in the window, so_i= S₀= "" Is output. Therefore, the summary is not displayed at this time. And the control part 11 complete | finishes this series of processes.
[0141]
In step S125, the control unit 11 of the document processing apparatus determines that the i + 1th average central activity value is the central activity value and the list l created in step S23._iThe central activity values of the elements having the highest central activity value are compared. Then, the control unit 11 determines that the central activity value of the sentence having the i + 1th average central activity value is the list l_iIf the central activation value of the element is higher than the central activation value of the element having the highest value, the process proceeds to the next step S27 as “YES”. The control unit 11 determines that the central activity value of the sentence with the i + 1th average central activity value is the list l_iIf the center activity value of the element is not higher than the center activity value of the element, the process proceeds to step S126 as “NO”.
[0142]
In step S126, the control unit 11 of the document processing apparatus increases the count value i of the counter by 1. And the control part 11 returns a process to step S123.
[0143]
In step S127, the control unit 11 of the document processing apparatus performs the list l._iThe element e with the highest central activity value of_iIn addition to ss_iIs generated. Element e_iDelete from. Then, the control unit 11 generates the ss generated in this way._iIs recorded in the RAM 14, for example.
[0144]
In step S128, the control unit 11 of the document processing apparatus performs ss._iIs the maximum number of characters w in the summary display section 54 of the window 51_sIt is determined whether it is larger. The control unit 11 is ss_iLength of w_sIf it is larger, “YES” is set and this series of steps is ended. The control unit 11 is ss_iLength of w_sIf it is not greater, “NO” is returned to step S125.
[0145]
In step S130, the control unit 11 of the document processing apparatus performs SS in step S128._iIs the maximum number of characters w_sSince it was determined to be greater than_iSet to. As a result, the maximum number of characters w_sA summary sentence is generated so as not to become larger. And the control part 11 complete | finishes this series of processes.
[0146]
The window 51 includes a keyword input unit 55 for inputting a keyword and a button unit 56 having a plurality of buttons. By inputting a keyword into the keyword input unit 55, among the words displayed on the document display unit 53, the actual interest level of the keyword and a word having a high degree of association between meanings to be described later is increased. The button unit 56 includes an “Undo” button for returning the execution result, and a process for summarizing the text displayed on the document display unit 53 and displaying it on the summary display unit 54. summarize) ”button. Among these, by selecting the “summary” button, for example, when the size of the summary display unit 54 is changed, the summary is displayed in the document processing unit 53 so as to correspond to the new size of the summary display unit 54. A summary of the document is generated, and the generated summary is displayed on the summary display unit 54.
[0147]
The actual interest level of the user with respect to the document is calculated based on a plurality of factors as follows. It should be noted that the actual interest level element is different from the elements constituting the document.
[0148]
In the calculation of the actual interest level, among the elements designated by the user, the position of the element whose appearance position in the document is farthest from the beginning of the document is the first element A (D_i). According to the first element, among the elements specified by the user, the position where the appearance position in the document is farthest from the beginning of the document is larger, and the user thinks that the document has been read more. It is assumed that the actual interest in the document is large. Specifically, the ratio between the maximum appearance position of the selected element and the size of the entire document is set as the first element A (D_i). Where D_iRepresents the i-th document.
[0149]
In the document display section 53 of the window 51 shown in FIG. 23, the first element 57, the second element 58, and the third element 59 are designated by the user and highlighted. Of these, the third element 59 farthest from the beginning of the document is used for the calculation of the actual interest level.
[0150]
Further, in the calculation of the actual interest level, the second element E indicates the number of items selected by the user from the document elements displayed on the document display unit 53 of the window 51 and the number of keywords input by the user in the keyword input unit 55. (D_i).
[0151]
In the document display section 53 of the window 51 shown in FIG. 23, designation of the first element 57, the second element 58, and the third element 59 is input by the user. The keyword “AAA” is input to the keyword input unit 55. The number of inputs of these elements and keywords is the second element E (D_i).
[0152]
Further, in the calculation of the actual interest level, the ratio of the size of the area of the summary display section 54 in the window 51 to the size of the area of the text display section 53 is set as the third element W (D_i). This is because the summary is displayed according to the size of the area of the summary display unit 54. However, the higher the user's actual interest level, the easier the user will find a detailed summary, that is, a long summary. Therefore, as the ratio of the size of the region of the summary display unit 54 to the size of the region of the text display unit 53 increases, the actual interest level can be increased.
[0153]
In the window 51 shown in FIG. 24, the ratio of the maximum size of the summary display unit 54 that displays the summary to the size of the document display unit 53 that displays the entire document is set as the third factor W (D_i).
[0154]
First element A (D of actual interest level_i), Second factor E (D_i) And the third element W (D_i) Based on the user's document D_iThe actual interest level IR (Di) for
IR (Di) = l₂W (Di) + m₂A (Di) + n₂E (Di)
It is defined as Where the coefficient l₂, M₂, N₂Is a constant and represents the contribution of each value to the actual interest level. These coefficients l₂, M₂, N₂The value of is l₂= M₂= 10, n₂= 1. The coefficient l₂, M₂, N₂The value of can also be estimated using statistical techniques. That is, the control unit 11 has a plurality of coefficients l₂, M₂, N₂Given the actual interest level IR (Di) for the set, the coefficient can be obtained by optimization.
[0155]
Next, document rearrangement performed based on the predicted interest level obtained using the actual interest level will be described with reference to FIG. Such document rearrangement is performed with the browser of FIG. 6 open.
[0156]
In step S111, the control unit 11 of the document processing apparatus sets a count value C of a counter that counts categories for classifying documents to 0. In step S112, the control unit 11 of the document processing apparatus calculates the degree of association between documents. That is, the control unit 11 is given an actual interest level for each unread document in the category indicated by the count value C among the uncategorized documents classified in step S23 in FIG. The degree of relevance between documents for each document is calculated. As described above, the actual interest level is given by the user's operation. The calculation of the degree of association between documents is performed based on the above-described index. Details of the calculation of the degree of association between documents will be described later.
[0157]
In step S113, the control unit 11 of the document processing apparatus calculates a predicted interest level. The predicted interest level is calculated based on the inter-document relevance level between the document and the document that has already been given the actual interest level. Therefore, the predicted interest level is calculated for a document for which no actual interest level is given.
[0158]
For one unread document in the category, the control unit 11 selects another document in the category having the maximum value of the relationship between documents calculated in step S112. The control unit 11 sets the actual interest level of the selected other document as the predicted interest level of one unread document. The control unit 11 stores the predicted interest level thus obtained in, for example, the RAM 14.
[0159]
In S118, the control unit 11 of the document processing apparatus branches the process depending on whether or not the calculation of the predicted interest level has been completed for all the documents in the category. The control unit 11 proceeds to step S114 as “YES” when the calculation is completed for all the documents in the category, and returns to step S112 as “NO” otherwise.
[0160]
In step S114, the control unit 11 of the document processing apparatus sorts unread documents for each category based on the predicted interest level calculated in step S113. As a document rearrangement method, a high priority order is given to unread documents with a high predicted interest level, and an unread document with a higher priority order can be arranged at the head of the unread document title array. . If there is no significant difference in the priority order, the received date and time is newer. The document titles are arranged in such an order for each category in the classification display sections 303, 304, and 305 of the document classification window 301, for example.
[0161]
In step S115, the control unit 11 of the document processing apparatus determines whether all categories have been completed. The control unit 11 advances the process to step S117 as “YES” when all categories are completed. The control unit 11 advances the process to step S116 as “NO” when all categories are not completed.
[0162]
In step S116, the control unit 11 of the document processing apparatus increases the counter value C for counting the category by 1. That is, the control unit 11 sets C = C + 1. And the control part 11 returns a process to step S112. In step S117, since it is determined that the processing for all categories has been completed in step S115, the control unit 11 displays the rearranged documents. Specifically, as shown in FIG. 6, a document icon and a document title are displayed. If there is no document title, a one sentence summary is displayed. Then, this series of steps is completed.
[0163]
Next, the calculation for calculating the degree of association between documents in step S112 in FIG. 25 will be described in detail with reference to FIG. The degree of relevance between documents is one document D_iAnd other documents D_jThe degree of relevance.
[0164]
In step S41, the control unit 11 of the document processing apparatus uses one document D._i A set of proper nouns included in the index of, and another document D already classified in the category specified in step S111 or S116 of FIG._jFor the set of proper nouns included in the index of P (D_i, D_j ). The control unit 11 then calculates the number P (D_i, D_j ) Is stored in the RAM 14, for example.
[0165]
In step S42, the control unit 11 of the document processing device refers to the meaning-of-sense relationship table shown in FIG._iMeaning and other documents included in the index of_jThe sum R of relations between meanings with meanings included in the index of R (D_i, D_j) Is calculated.
[0166]
In step S42, the control unit 11 of the document processing apparatus determines one unread document D._iFor words other than proper nouns, refer to the table of relevance between meanings and other documents D_jR (D_i , D_j) Is calculated. Then, the control unit 11 calculates the total sum R (D_i , D_j) Is stored in the RAM 14, for example.
[0167]
In step S43, the control unit 11 of the document processing apparatus uses one document D._i Other documents for_jRelevance between documents
Rel (D_i , D_j) = M_ThreeP (D_i, D_j ) + N_ThreeR (D_i, D_j )
It is defined as Where the coefficient m_Three, N_ThreeIs a constant and represents the degree of contribution of each value to the inter-document relevance. The control unit 11 determines the number P (D of common sets calculated in step S41._i , D_j ) And the total sum R (D_i , D_j) Is read from the RAM 14, for example, and is applied to the above-described formula to determine the inter-document relevance Rel (D_i , D_j ) Is calculated. These coefficients m_Three, N_ThreeFor example, m_Three= 10, n_Three= 1.
[0168]
Coefficient m_ThreeAnd n_ThreeThe value of can also be estimated using statistical techniques. That is, the control unit 11 includes a plurality of coefficients m_ThreeAnd n_ThreeInter-document relevance Rel (D_i , D_j) Can be obtained by optimization.
[0169]
Next, the recording medium 32 recorded / reproduced by the recording / reproducing unit 31 of the document processing apparatus will be described. A document processing program for processing a document having an internal structure by tagging from a plurality of elements is recorded on the recording medium. As the recording medium 32, for example, a floppy disk capable of recording / reproducing information is used.
[0170]
The recording medium 32 includes an actual interest level detection process for detecting the actual interest level for the document, and a priority order setting process for setting a priority order for the document based on the actual interest level detected by the actual interest level detection process. Furthermore, the recording medium 32 further includes a display process for displaying the document and an input process for receiving manual input for the document displayed in the display process. The actual interest level detection process is performed by the input unit. The actual interest level is detected based on.
[0171]
In the present embodiment, an example of a tagging method for a document has been shown, but it is needless to say that the present invention is not limited to this tagging method. In the present embodiment, the document is transmitted from the outside to the receiving unit 21 of the document processing apparatus, but the present invention is not limited to this. For example, the document may be written in the ROM 13 of the document processing apparatus or read from the recording medium 32 by the recording / reproducing unit 31.
[0172]
In the above-described embodiment, the mouse is exemplified as a device for selecting a desired element from the document displayed on the display unit 30 of the document processing apparatus. However, it goes without saying that the present invention is not limited to this. . Other devices such as a tablet and a light pen can be used to input elements in the document processing apparatus.
[0173]
Furthermore, in the above-described embodiment, Japanese and English sentences are exemplified, but it goes without saying that the present invention is not limited to these languages.
[0174]
【The invention's effect】
As described above, the present invention processes an electronic document, detects an actual interest level for the electronic document, and sets a priority order for the electronic document based on the detected actual interest level. The present invention also displays an electronic document, accepts manual input for the displayed electronic document, and detects the actual interest level based on this input. Therefore, according to the present invention, the priority of the electronic document is set reflecting the actual interest level of the user, so that the convenience of the user is achieved.
[0175]
Furthermore, according to the present invention, the priority level is set based on the predicted interest level, with the actual interest level of the document having the highest relevance among the electronic documents for which the actual interest level has already been obtained as the predicted interest level. Therefore, according to the present embodiment, a priority order can be given to a document to which no actual interest level is given.
[0176]
In the present invention, the electronic document is classified into a plurality of classification items, and the priority order is set for the electronic document for each classification item. Therefore, the present invention provides convenience to the user by setting a priority for each classification item.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a document processing apparatus to which an exemplary embodiment is applied.
FIG. 2 is a diagram showing an internal structure by tagging a document.
FIG. 3 is a diagram showing a window displaying an internal structure by tagging a document.
FIG. 4 is a flowchart showing an operation of a document processing apparatus to which the embodiment is applied.
FIG. 5 is a diagram illustrating a GUI for performing document classification before document classification;
FIG. 6 is a diagram showing a GUI for classifying documents.
FIG. 7 shows a table of classification models.
FIG. 8 is a flowchart for automatically classifying documents.
FIG. 9 is a flowchart for creating an index by finding document features;
FIG. 10 is a flowchart showing active diffusion.
FIG. 11 is a diagram for explaining active diffusion processing;
FIG. 12 is a flowchart of link processing for active diffusion.
FIG. 13 is a flowchart for calculating the degree of association between document classifications.
FIG. 14 is a flowchart of calculation of the degree of association between meanings.
FIG. 15 is a diagram showing a table of relevance levels between meanings.
FIG. 16 is a flowchart for browsing and sorting documents.
FIG. 17 is a flowchart showing a series of steps for increasing the importance of an arbitrary part of a sentence.
FIG. 18 shows a summary window.
FIG. 19 is a diagram showing a state in which a word is selected in the summary window.
FIG. 20 is a diagram illustrating a state where a region selected in the summary window is further clicked.
FIG. 21 is a diagram showing a state in which a summary is displayed in a summary window.
FIG. 22 is a diagram showing details of the summary creation process.
FIG. 23 is a diagram for explaining the calculation of the actual interest level from the maximum appearance position of the selected element.
FIG. 24 is a diagram for explaining the calculation of the actual interest level from the maximum size of the summary element and the ratio of the entire document.
FIG. 25 is a flowchart for automatically classifying a document based on a predicted interest level.
FIG. 26 is a flowchart for calculating the degree of association between documents.
[Explanation of symbols]
10 main body, 11 control unit, 12 interface, 13 CPU, 20 input unit, 21 receiving unit, 30 display unit, 31 recording / reproducing unit

Claims

In a document processing method of a document processing apparatus that processes a plurality of electronic documents,
A receiving step in which the receiving means receives a plurality of electronic documents;
A recording step for recording a plurality of electronic documents received in the receiving step;
A display step for displaying an electronic document selected by the user among the plurality of electronic documents recorded in the recording step and a summary of the electronic document;
An input step in which the input means inputs user operation information for the electronic document displayed in the display step;
The actual interest level that the actual interest level detection means calculates the actual interest level of the user for the electronic document based on the operation information of the user input in the input step for the electronic document displayed in the display step A detection process;
The priority order setting means has a relevance level based on the internal structure of the electronic document among the electronic documents for which the actual interest level is calculated in the actual interest level detection step with respect to the electronic document for which the actual interest level is not calculated. A priority setting step in which the actual interest level of the highest electronic document is set as a predicted interest level, and a priority order is set based on the predicted interest level ;
A sorting step in which sorting means sorts electronic documents not selected by the user among the plurality of electronic documents recorded in the recording step according to the priority set in the priority setting step; Have
In the actual interest level detection step, for each display area for displaying the electronic document and the summary of the electronic document, the appearance position in the electronic document among the elements of the electronic document selected by the user is the electronic document. The first actual interest level element composed of the ratio between the appearance position of the element farthest from the head of the document and the size of the electronic document, the number of keywords specified by the user, and the number of elements selected by the user A third actual interest level comprising a second actual interest level element of the actual interest level and a ratio of the size of the display area of the summary of the electronic document displayed in the display step and the size of the display area of the electronic document. A document processing method for calculating the actual interest level using an interest level element .

An index creating means for extracting features based on the internal structure of the plurality of electronic documents received in the receiving step, and creating an index of the extracted features;
An index recording step in which the index recording means records the index created in the index creation step;
The index update means performs an update process for recording the user's operation history based on the user's operation information input in the input step and / or the actual interest level calculated in the actual interest level detection step in the index. The document processing method according to claim 1, further comprising an index update step.

The electronic document, the document processing method according to claim 1, wherein the internal structure that is described by the attribute information.

The classification means further comprises a classification step of classifying the electronic document into a plurality of classification items;
Above the priority setting step, the priority setting unit, the document processing method according to claim 1, wherein to set the priority of the electronic document for each classified category items by the classifying means in the classification step.

In a document processing apparatus that processes a plurality of electronic documents,
Receiving means for receiving a plurality of electronic documents;
Recording means for recording a plurality of electronic documents received by the receiving means;
Display means for displaying an electronic document selected by the user among the plurality of electronic documents recorded by the recording means and a summary of the electronic document;
Input means for inputting user operation information for the electronic document displayed by the display means;
Real interest level detection means for calculating the actual interest level of the user for the electronic document based on the user operation information input by the input means for the electronic document displayed by the display means;
The actual interest of the electronic document having the highest relevance based on the internal structure of the electronic document among the electronic documents of which the actual interest level is calculated by the actual interest level detection unit with respect to the electronic document for which the actual interest level is not calculated A priority setting means for setting a priority based on the predicted interest level,
Reordering means for reordering electronic documents not selected by the user among the plurality of electronic documents recorded by the recording means according to the priority set by the priority setting means;
The actual interest level detection means is configured such that, for each display area for displaying the electronic document and the summary of the electronic document, the appearance position in the electronic document among the elements of the electronic document selected by the user is the electronic document. The first actual interest level element composed of the ratio between the appearance position of the element farthest from the head of the document and the size of the electronic document, the number of keywords specified by the user, and the number of elements selected by the user A third actual interest level comprising a second actual interest level element of the actual interest level, and a ratio of the size of the display area of the summary of the electronic document displayed by the display means and the size of the display area of the electronic document. A document processing apparatus that calculates the actual interest level using a degree element .

In a computer-readable recording medium recorded with a document processing program for causing a computer to execute document processing for processing a plurality of electronic documents,
The above document processing program
A receiving step for receiving a plurality of electronic documents;
A recording step for recording a plurality of electronic documents received in the receiving step;
A display step for displaying an electronic document selected by the user among the plurality of electronic documents recorded in the recording step and a summary of the electronic document;
An input step in which the input means inputs user operation information for the electronic document displayed in the display step;
The actual interest level that the actual interest level detection means calculates the actual interest level of the user for the electronic document based on the operation information of the user input in the input step for the electronic document displayed in the display step A detection process;
The priority order setting means has a relevance level based on the internal structure of the electronic document among the electronic documents for which the actual interest level is calculated in the actual interest level detection step with respect to the electronic document for which the actual interest level is not calculated. A priority setting step in which the actual interest level of the highest electronic document is set as a predicted interest level, and a priority order is set based on the predicted interest level;
A sorting step in which sorting means sorts electronic documents not selected by the user among the plurality of electronic documents recorded in the recording step according to the priority set in the priority setting step; Let the computer run,
In the actual interest level detection step, for each display area for displaying the electronic document and the summary of the electronic document, the appearance position in the electronic document among the elements of the electronic document selected by the user is the electronic document. The first actual interest level element composed of the ratio between the appearance position of the element farthest from the head of the document and the size of the electronic document, the number of keywords specified by the user, and the number of elements selected by the user A third actual interest level comprising a second actual interest level element of the actual interest level and a ratio of the size of the display area of the summary of the electronic document displayed in the display step and the size of the display area of the electronic document. A recording medium on which a document processing program for calculating the actual interest level using an interest level element is recorded .