JP3845046B2

JP3845046B2 - Document management method and document management apparatus

Info

Publication number: JP3845046B2
Application number: JP2002237303A
Authority: JP
Inventors: 直也植松; 一郎廣田
Original assignee: 株式会社ジャストシステム
Priority date: 2002-08-16
Filing date: 2002-08-16
Publication date: 2006-11-15
Anticipated expiration: 2022-08-16
Also published as: JP2004078512A

Description

【０００１】
【発明の属する技術分野】
この発明は、文書管理方法および文書管理装置に関する。この発明は特に、複数の文書をその内容に基づいて関連づける技術に関する。
【０００２】
【従来の技術】
近年、別々の人によって作成された社内文書などの大量のファイルを一元的に管理して有効利用するための技術に注目が集まっている。蓄積された文書は知識とノウハウの凝縮でもあるが、こうした潜在的な資産を真に意味のある情報として共有し活用するには、あらかじめ計画的に文書を管理することが近道となる。その手法としては、データベース化を前提としてはじめから一つの文書を一つのレコードとして扱う方法や、検索用インデックスを各文書に内包させる方法などがある。
【０００３】
一方、ひとかたまりの文書を効率的に抽出するために、関連のある文書同士をあらかじめ紐付けしておく手法がある。文書のヘッダなどの領域に、その文書と関連する他の文書の参照情報を付加しておけば紐付けが実現される。例えば、電子メールソフトウエアにおいては、特定の受信メールに対して返信メールを作成するとき、その返信メールのヘッダに受信メールの参照ＩＤを記述してそれらの関連性を保持している。電子掲示板でも同様な手法で参照関係を管理している。
【０００４】
【発明が解決しようとする課題】
上記の各方法により管理される文書は、あらかじめ特定の管理方法を前提として作成されるので、インデックスや参照情報が付加されるなどの処理が作成時に施される。しかしながら、そうした処理がなされていない文書は同手法による管理の対象外になってしまう。これらの文書を管理対象にするためにはインデックスを付加する処理や参照関係を見出す処理などを手作業で行う必要があり、多大な時間と手間を要する。一方、計画的に文書同士が関連づけられた場合であっても、与えられた関連性の視点は画一的であり、必ずしも有効に二次利用できるわけではなかった。
【０００５】
本発明者は以上の認識に基づき本発明をなしたもので、その目的は、多量の文書ファイルを効率的に管理する技術の提供にある。本発明の別の目的は、文書ファイルの検索キーとして文書の類型を検出する技術の提供にある。さらに別の目的は、文書データベースの構築を支援する技術の提供にある。さらに別の目的は、複数の文書における傾向の分析を支援する技術の提供にある。さらに別の目的は、文書同士の関連性を視覚化する技術の提供にある。
【０００６】
【課題を解決するための手段】
本発明のある態様は、文書管理方法に関する。この方法は、文書間において関連性を見出すための視点となる語句を決定する工程と、決定した語句を内容に含む複数の文書を抽出する工程と、抽出した複数の文書を関連づけることによって文書間のコンテキストの存在を認識する工程と、コンテキストの存在が認識された文書を提示する工程と、を含む。
【０００７】
「文書」は、例えばワードプロセッサ、プレゼンテーションソフト、電子メールクライアント、スケジューラなどのアプリケーションソフトウエアによって生成されたファイルが該当する。電子掲示板のＣＧＩ（Common Gateway Interface）などのサーバプログラムによって生成されたテキストデータであってもよい。「文書」は言語解析処理の対象とするため主にテキストを含むデータファイルを想定するが、データ形式としては必ずしも限定せず、画像ファイルやＣＡＤデータファイルなどであってもよい。
【０００８】
「語句」は、単語、フレーズ、文節などを含み、語句の長さは問わない。係り受けの関係にある複数語からなるフレーズであってもよい。「語句」は、必ずしも文書中に登場しなくてもよく、文書間に共通する話題や概念を表す語句であってもよい。「文書間のコンテキスト」は、文書間に存在する内容上の関連性であって、複数の文書を作成日時順や更新日時順に並べたときの文書内容の脈絡であってもよい。このコンテキストを文書内容に基づいて自動検出するとともに、その特徴を文書検索のための検索キーとして利用する。
【０００９】
複数の文書の関連づけは、相互にリンクを設定することによってスレッドを形成させる手法でもよい。「スレッド」は、話題ごとに区別された複数のデータがそれぞれ時系列的に前後するデータと関連づけられた形式であり、本態様においては視点となる語句ごとにスレッドが形成される。一般的には電子掲示板や電子メールクライアントにスレッド形式が用いられ、本態様における一連の文書間のコンテキストはスレッド形式での表示が可能である。「時系列順」は、文書の作成または更新の日時だけでなく、その文書が参照された日時やその文書が稟議を通った日時に基づいて判断されてもよい。
【００１０】
この態様によれば、文書同士の関連性について何も情報が付加されていない複数の文書が自動的に関連づけされる。検出される関連性はあくまでも文書内容に基づいているので、関連性の連鎖が文書間のコンテキストを示すこととなる。これをスレッドのような馴染みやすい形式で表現できるので、文書全体についてその関連性を把握しながら効率的に管理できる。
【００１１】
本発明の別の態様は、文書管理装置に関する。この装置は、文書を記憶する格納部と、文書間において関連性を見出すための視点となる語句を決定する条件設定部と、決定した語句を内容に含む複数の文書を格納部から抽出する抽出部と、抽出した複数の文書のうち、所定の規則に従った序列の下で隣接する文書の組み合わせをリンクにより関連づけるとともに、複数のリンクを連鎖させることによって文書間のコンテキストの存在を認識するスレッド管理部と、コンテキストの存在が認識された文書を提示する出力部と、を有する。
【００１２】
格納部は、主にハードディスクなどの記憶装置を示す。この格納部を本体から分離した形で構成してもよく、その場合ネットワークを介して本体に接続してもよい。複数の文書の格納位置や格納状態は様々な形態が想定できる。条件設定部、抽出部、スレッド管理部、および出力部は、中央演算装置や制御装置などによって実現される。
【００１３】
条件設定部は、視点となる語句を文書内容に含まれる特徴的な語句を自動抽出することによって決定してもよいし、ユーザによる指定に基づいて決定してもよい。「所定の規則に従った序列」は、例えば文書ごとに割り当てられたシリアル番号の昇順でもよいし、各文書に含まれる何らかの属性に着目して客観的な序列が導き出されればよい。「リンク」としては、例えばＸＭＬ（eXtensible Markup Language）におけるＸＬｉｎｋなどの手法を用いて、文書間で同じ語句同士を結びつけてもよい。
【００１４】
本装置をファイルサーバとして実現する場合、新たな文書をそのサーバへ転送するだけで自動的に他の文書と関連づけられる。同様に、すでに大量に蓄積された過去の文書をそのサーバへ入力すれば自動的な関連づけが処理される。これにより、例えば社内文書のように複数人によって多量に作成または編集されるファイルを、その作成者または編集者の手を煩わせずに効率的に管理できる。
【００１５】
本発明のさらに別の態様もまた文書管理装置に関する。この装置は、文書を記憶する格納部と、文書間において関連性を見出すための視点となる語句を決定する条件設定部と、決定した語句を内容に含む複数の文書を格納部から抽出する抽出部と、抽出した複数の文書のうち、時系列順で隣接する文書の組み合わせをリンクにより関連づけるとともに、複数のリンクを連鎖させることによって文書間のコンテキストの存在を認識するスレッド管理部と、コンテキストの存在が認識された文書を提示する出力部と、を有する。
【００１６】
この態様においても、多量のファイルを効率的に管理できる他、時系列順で関連づけることによって、文書間の因果関係をより直感的に把握できる形で提示できる。
【００１７】
本装置は、ネットワーク上に設置されたサーバまたはユーザ端末のいずれかを指す場合と、ネットワークを介して接続されたサーバおよびユーザ端末からなるシステムを指す場合の双方を含み、格納部、条件設定部、抽出部、スレッド管理部、および出力部の各機能ブロックは、サーバ側に設けられてもよいし、ユーザ端末側に設けられてもよい。サーバおよびユーザ端末の双方に設けられてもよいし、その場合の各機能ブロックはそれぞれ同じ名称でもよい。これら各機能は、それぞれプログラムモジュールのかたちで提供され、サーバおよびユーザ端末のいずれかまたは双方にて実行されてもよいし、実行にあたりサーバからユーザ端末へダウンロードされてもよい。
【００１８】
なお、以上の構成要素の任意の組合せや、本発明の構成要素や表現を方法、装置、システム、コンピュータプログラム、コンピュータプログラムを格納した記録媒体、データ構造などの間で相互に置換したものもまた、本発明の態様として有効である。
【００１９】
【発明の実施の形態】
本実施形態においては、含まれる語句の共通性をもとに複数の文書を関連づけ、導き出される複数の関連性の分布や連鎖の態様に基づいて文書間のコンテキストを検出する。さらに、各文書をコンテキストの態様に応じて分類してこれを検索条件の一つとして扱う。これにより、あらかじめ関連づけられていない複数の文書から様々な視点による関連性を見出すことができ、それら関連性の傾向から検索インデックスの代用となる情報を生み出す。また、文書から語句を抽出できる限りデータ形式を問わず管理の対象とすることができる。
【００２０】
図１は、実施の形態に係る文書管理システムの構成を示す図である。文書管理システム１００は、ネットワーク１０２を介して接続される複数のユーザ端末１０４および文書管理サーバ１０により構成される。文書管理サーバ１０は、ウェブサーバを含む構成であってもよいし、ユーザ端末１０４は、パーソナルコンピュータと、これにインストールされたクライアントソフトウェアを含む構成であってもよい。
【００２１】
文書管理サーバ１０は、ハードウェア的には、コンピュータのＣＰＵをはじめとする素子で実現でき、ソフトウェア的には言語解析機能やデータ管理機能のあるプログラムなどによって実現されるが、以下説明する図２ではそれらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組合せによっていろいろなかたちで実現できる。
【００２２】
図２は、実施の形態に係る文書管理装置の基本構成を示す機能ブロック図である。文書管理サーバ１０は、入力部１２、格納部１４、条件設定部１６、文書管理部２２、出力部３６、および通信部４２を有する。各部は、通信部４２を介してネットワーク上の端末との間でデータを送受信する。通信部４２は、データ送受信機能のほか、ルータ機能やサーバ機能を有してもよい。
【００２３】
格納部１４は、複数の文書を記憶する。格納部１４へは入力部１２を介して文書が格納される。入力部１２はデータ入力のインターフェイスであって、通信部４２を介する入力の他、他のデータポートを介するデータ入力を処理してもよい。入力部１２は、文書を入出力するためのポータルサイトや電子掲示板などのワークスペースをユーザに提供してもよい。入力部１２が外部文書を取り込む場合、取り込む範囲としてネットワーク上のドメインをユーザに指定させてもよい。
【００２４】
条件設定部１６は、文書間において関連性を見出すための視点となる語句（以下、「視点語句」という。）を決定する機能を有し、文書解析部１８および語句決定部２０を含む。文書解析部１８は、格納部１４に格納された文書の内容を解析して特徴的な語句を抽出する。例えば、各文書に共通して多く含まれる語句を抽出してもよいし、固有名詞を含む名詞句を中心に抽出してもよい。形容詞などの修飾語も含んだ形で語句を抽出してもよい。文書のタイトルに含まれる語句か否かに応じて重み付けしてもよいし、文書を要約してから語句を抽出してもよい。抽出語句の優先度をあらかじめ設定しておいてもよい。
【００２５】
文書解析部１８は、ワークスペースの広狭や部門分けに応じて解析の対象となる文書範囲を決定してもよいし、ユーザが指定するドメイン内の文書を解析の対象としてもよい。
【００２６】
語句決定部２０は、文書解析部１８による抽出結果またはユーザによる指示に基づいて視点語句を決定する。ユーザの指示は、通信部４２を介してネットワーク１０２経由で受け取られる。
【００２７】
文書管理部２２は、抽出部２４、時系列処理部２６、およびスレッド管理部２８を含む。時系列処理部２６は、格納部１４に格納された複数の文書を時系列順に並び替える。このとき、各文書が持つ作成日時や更新日時などの時間属性に基づいて並び替えてもよいし、文書生成の経緯に関して別途管理される履歴に基づいて並び替えてもよい。入力部１２が提供するワークスペースを通じて履歴を記録してもよい。その履歴は格納部１４に格納されてもよい。
【００２８】
抽出部２４は、条件設定部１６により決定された語句を内容に含む複数の文書を格納部１４から抽出する。このとき抽出部２４は、同じ語句を含む複数の文書のうち時系列順に並べられたときに隣接しあう文書同士をリンク対象として組み合わせる。
【００２９】
スレッド管理部２８は、リンク処理部３０、コンテキスト認識部３２、および分類処理部３４を含む。リンク処理部３０は、リンク対象の文書間において共通する視点語句同士をＸＬｉｎｋなどのリンク方法で文書内に記述することにより相互に関連づける。リンクする対象は、ＸＭＬ文書においてタグで囲まれた要素を単位としてもよい。コンテキスト認識部３２は、複数のリンクをスレッドとして連鎖させることによって一連の文書間のコンテキストの存在を認識する。スレッドまたはコンテキストを特定するための情報が分類処理部３４へ送られる。
【００３０】
分類処理部３４は、一連のコンテキストで関連づけられた複数の文書を、それぞれから認識されるコンテキストの態様に応じて分類する。各文書には通常複数のコンテキストが流れており、前後の文書とのリンク形態に見られる一定の傾向からコンテキストの態様をいくつかの類型に分類する。その類型を属性として保存し、後に検索条件の一つに指定できるよう扱う。その属性を各文書内に記述してもよいし、管理データとして格納部１４へ一括保存してもよい。これにより、従来にない形での検索の絞り込みやノイズ除去が可能となる。コンテキストの類型については、後述する。
【００３１】
出力部３６は、複数の文書をコンテキストとともに提示する機能を有し、検索処理部３８および表示処理部４０を含む。検索処理部３８は、コンテキストの態様に関するユーザの指定を検索条件の一つとして取得し、その指定された態様に分類される文書を格納部１４から抽出し、検索結果として提示する。表示処理部４０は、認識されたコンテキストをスレッドの形式で画面表示させるためのデータをユーザ端末１０４へ送る機能と、検索処理部３８による検索結果をユーザ端末１０４へ送る機能を有する。表示処理部４０は、コンテキストを表示する画面を通じて、ユーザの指示に基づいたコンテキストの修正機能を提供してもよい。
【００３２】
図３は、各文書に含まれる語句のリンクによって連鎖する文書間のコンテキストを示す図である。これら一連の文書には複数のコンテキストが流れ、それぞれから認識されるコンテキストの態様に応じて文書をいくつかの類型に分類する。分類処理部３４は、各文書においてコンテキストの態様に関する特徴点を見出す。具体的には、コンテキストの態様がその文書において始点、合流点、通過点、分岐点、および終点のいずれに該当するかに基づいて文書を分類する。
【００３３】
本図では、各文書から「Ａ社」および「製品Ｂ」という視点語句を抽出してリンクを設定する。各文書は時系列的に並べられている。第１の文書５０以前の文書からは「Ａ社」および「製品Ｂ」の語句は抽出されず、以前の文書とはリンクされていない。そこで、第１の文書５０はコンテキストの始点になる傾向が強い類型であると判断する。
【００３４】
第１の文書５０に含まれる語句「Ａ社」および「製品Ｂ」は、それぞれ第２の文書５２に含まれる語句「Ａ社」および「製品Ｂ」にリンクされ、それがさらに第３の文書５４に含まれる語句「Ａ社」および「製品Ｂ」にリンクされる。そこで、第２の文書５２はコンテキストの通過点になる傾向が強い類型であると判断する。
【００３５】
第３の文書５４に含まれる語句「Ａ社」は第４の文書５６に含まれる語句「Ａ社」にリンクされるが、第３の文書５４に含まれる語句「製品Ｂ」は第５の文書５８に含まれる語句「製品Ｂ」にリンクされる。そこで、第３の文書５４はコンテキストの分岐点になる傾向が強い類型であると判断する。この類型を、意思決定の要因となった文書と位置づけてもよいし、分岐の態様をさらに細分化してその一つを議事録である可能性が高いと位置づけてもよい。
【００３６】
第４の文書５６に含まれる語句「Ａ社」および第５の文書５８に含まれる語句「製品Ｂ」は、それぞれ第６の文書６０に含まれる語句「Ａ社」または「製品Ｂ」にリンクされる。そこで、第６の文書６０はコンテキストの合流点になる傾向が強い類型であると判断する。この類型を、プロジェクトの集束あるいは節目となる文書と位置づけてもよい。
【００３７】
第６の文書６０に含まれる語句「Ａ社」および「製品Ｂ」は、それぞれ第７の文書６２に含まれる語句「Ａ社」および「製品Ｂ」にリンクされ、以降の文書にはリンクされていない。そこで、第７の文書６２はコンテキストの終点になる傾向が強い類型であると判断する。
【００３８】
本図では第１のコンテキスト６４および第２のコンテキスト６６の二つだけを示すが、実際には多数のコンテキスト認識が可能である。認識するコンテキストの数が多いほど態様の傾向が明確になる場合も考えられる。そうした傾向の分類基準を図２の分類処理部３４があらかじめ保持してもよい。
【００３９】
図４は、時系列的に配置された複数の文書が複数のコンテキストにより関連づけられた状態を示す図である。縦軸がワークスペースの軸であり、横軸が時間軸である。ワークスペースは、例えば社内における各部門や各プロジェクトの広がりを示す空間でもよいし、本装置によって管理の対象とする文書の分類に応じて形成される空間であってもよい。
【００４０】
図に示すコンテキストの全体空間をユーザ端末１０４の画面に表示してもよい。その場合、例えばユーザが特定の視点語句を指定したときに、対応するコンテキストだけを強調表示してもよい。複数のコンテキストからいずれかをユーザに選択させ、選択されたコンテキストの表示をツリー状のスレッド形式に切り替えてもよい。
【００４１】
図５は、語句ごとのスレッドをツリー状に表示する画面例を示す図である。第１のスレッド７０においては、図３および４の第１のコンテキスト６４に含まれる各文書のタイトルと作成日がツリー状に表示される。同様に、第２のスレッド７２においては第２のコンテキスト６６に含まれる各文書のタイトルと作成日がツリー状に表示される。これにより、ユーザが指定した語句と関連する複数の文書をその時系列の関連性とともに視覚化できる。
【００４２】
図６は、実施の形態に係る文書管理装置により実行される処理を示すフローチャートである。格納部１４に格納された複数の文書を時系列順に並べ替え（Ｓ１０）、文書間のコンテキストを見出すための視点語句を決定する（Ｓ１２）。その視点語句を含んだリンク対象となる文書を抽出し（Ｓ１４）、時系列的に前後する文書の組み合わせにおいて語句同士のリンクを付加する（Ｓ１６）。複数のリンクを連鎖させたコンテキストを認識し（Ｓ１８）、その態様に応じて文書を分類する（Ｓ２０）。以上の処理によって検索の前提となる前処理が完了する。
【００４３】
検索キーワードとともに、文書の分類に関する検索条件をユーザから取得し（Ｓ２２）、それらの検索条件に基づいて検索を実行する（Ｓ２４）。検索結果として抽出された文書をユーザ端末１０４の画面に提示する（Ｓ２６）。
【００４４】
以上、本発明を実施の形態をもとに説明した。この実施の形態は例示であり、その各構成要素や各処理プロセスの組合せにいろいろな変形が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。以下、変形例を挙げる。
【００４５】
実施の形態においては、同じ語句を含む文書同士でリンクを設定するが、変形例においては語句の同一性に関して厳密さを要求せず、文書ごとに語句を正規化したり、類義語辞書や統制語辞書を用いて柔軟性を持たせてもよい。
【００４６】
図６において、文書を時系列順に並べ替える処理はＳ１２の前に位置しているが、変形例としてはＳ１２とＳ１４の間に処理してもよい。または、並べ替えの処理をＳ１４とＳ１６の間に実行することとし、文書の抽出後にそれらを時系列順に並べ替える手順を採用してもよい。
【００４７】
実施の形態においては、コンテキストの分類を検索条件の一つとして扱うが、変形例においては、文書に流れるコンテキストの本数に応じた重み付けや、オーソリティとハブの関係に応じた重み付けを加えることによって検索結果に優先順位をつけてもよい。
【００４８】
入力部１２から電子掲示板の投稿データを入力した場合、その電子掲示板上で管理するスレッドと異なるスレッドが形成されることが考えられる。電子掲示板上では一つのスレッドに属する一連のデータが実際には途中で話題が分断していたとき、実施の形態においては自動的に別スレッドに分けることが可能である。
【００４９】
【発明の効果】
本発明によれば、蓄積された情報の活用に関してユーザの利便性を高めることができる。
【図面の簡単な説明】
【図１】実施の形態に係る文書管理システムの構成を示す図である。
【図２】実施の形態に係る文書管理サーバの基本構成を示す機能ブロック図である。
【図３】各文書に含まれる語句のリンクによって連鎖する文書間のコンテキストを示す図である。
【図４】時系列的に配置された複数の文書が複数のコンテキストにより関連づけられた状態を示す図である。
【図５】語句ごとのスレッドをツリー状に表示する画面例を示す図である。
【図６】実施の形態に係る文書管理サーバにより実行される処理を示すフローチャートである。
【符号の説明】
１０文書管理サーバ、１４格納部、１６条件設定部、２０語句決定部、２４抽出部、２８スレッド管理部、３６出力部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document management method and a document management apparatus. In particular, the present invention relates to a technique for associating a plurality of documents based on their contents.
[0002]
[Prior art]
In recent years, attention has been focused on technologies for centrally managing and effectively using a large number of files such as in-house documents created by different people. Accumulated documents are a condensation of knowledge and know-how, but in order to share and utilize such potential assets as truly meaningful information, managing documents in advance is a shortcut. The methods include a method of handling one document as one record from the beginning on the premise of creating a database, and a method of including a search index in each document.
[0003]
On the other hand, in order to efficiently extract a group of documents, there is a method of associating related documents in advance. Linking is realized by adding reference information of another document related to the document to an area such as a document header. For example, in the electronic mail software, when a reply mail is created for a specific received mail, the reference ID of the received mail is described in the header of the reply mail to maintain the relevance thereof. The reference relationship is also managed in the electronic bulletin board by a similar method.
[0004]
[Problems to be solved by the invention]
Since the document managed by each of the above methods is created in advance on the premise of a specific management method, processing such as addition of an index and reference information is performed at the time of creation. However, a document that has not been subjected to such processing is excluded from management by this method. In order to set these documents as management targets, it is necessary to manually perform an index adding process and a process of finding a reference relationship, which requires a lot of time and labor. On the other hand, even when documents are related to each other in a planned manner, the viewpoint of the given relevance is uniform, and it cannot always be used effectively effectively.
[0005]
The present inventor has made the present invention based on the above recognition, and an object thereof is to provide a technique for efficiently managing a large amount of document files. Another object of the present invention is to provide a technique for detecting a document type as a search key for a document file. Yet another object is to provide a technology that supports the construction of a document database. Yet another object is to provide a technology that supports analysis of trends in a plurality of documents. Yet another object is to provide a technique for visualizing the relationship between documents.
[0006]
[Means for Solving the Problems]
One embodiment of the present invention relates to a document management method. This method includes a step of determining a word / phrase as a viewpoint for finding relevance between documents, a step of extracting a plurality of documents including the determined word / phrase in content, and a step of associating the plurality of extracted documents with each other. Recognizing the presence of the context, and presenting a document with the recognized context.
[0007]
The “document” corresponds to a file generated by application software such as a word processor, presentation software, an e-mail client, and a scheduler. It may be text data generated by a server program such as CGI (Common Gateway Interface) of an electronic bulletin board. The “document” is assumed to be a data file mainly including text in order to be subjected to language analysis processing. However, the data format is not necessarily limited, and may be an image file, a CAD data file, or the like.
[0008]
“Phrase” includes words, phrases, phrases, etc., and the length of the phrase is not limited. A phrase composed of a plurality of words having a dependency relationship may be used. The “word / phrase” does not necessarily appear in the document, and may be a word / phrase representing a topic or concept common to the documents. The “context between documents” is a relationship between contents existing between documents, and may be a context of document contents when a plurality of documents are arranged in order of creation date / time or update date / time. The context is automatically detected based on the document content, and the feature is used as a search key for document search.
[0009]
The association of a plurality of documents may be a technique of forming a thread by setting a link with each other. A “thread” is a format in which a plurality of pieces of data distinguished for each topic are associated with data that precedes and follows in time series. In this aspect, a thread is formed for each word or phrase as a viewpoint. Generally, a thread format is used for an electronic bulletin board or an e-mail client, and a context between a series of documents in this aspect can be displayed in a thread format. “Time-series order” may be determined based not only on the date or time of creation or update of a document, but also on the date and time when the document was referred to or the date and time when the document passed the deliberation.
[0010]
According to this aspect, a plurality of documents to which no information is added regarding the relationship between documents are automatically associated. Since the detected relevance is based solely on the document content, the relevance chain indicates the context between the documents. Since this can be expressed in a familiar format such as a thread, it is possible to efficiently manage the entire document while grasping its relevance.
[0011]
Another aspect of the present invention relates to a document management apparatus. The apparatus includes a storage unit that stores documents, a condition setting unit that determines a word / phrase as a viewpoint for finding relevance between documents, and an extraction that extracts a plurality of documents including the determined word / phrase from the storage unit. A thread that recognizes the existence of a context between documents by associating a combination of adjacent documents with a link among a plurality of extracted documents and a sequence in accordance with a predetermined rule by a link and linking a plurality of links A management unit; and an output unit that presents a document in which the existence of the context is recognized.
[0012]
The storage unit mainly indicates a storage device such as a hard disk. The storage unit may be configured separately from the main body, and in that case, may be connected to the main body via a network. Various forms of storage positions and storage states of a plurality of documents can be assumed. The condition setting unit, the extraction unit, the thread management unit, and the output unit are realized by a central processing unit, a control device, and the like.
[0013]
The condition setting unit may determine a word / phrase as a viewpoint by automatically extracting a characteristic word / phrase included in the document content, or may be determined based on designation by a user. The “order according to a predetermined rule” may be, for example, ascending order of serial numbers assigned to each document, or an objective order may be derived by paying attention to some attribute included in each document. As the “link”, for example, a method such as XML Link in XML (eXtensible Markup Language) may be used to link the same phrases between documents.
[0014]
When this apparatus is realized as a file server, it is automatically associated with another document simply by transferring a new document to the server. Similarly, if past documents already accumulated in large quantities are input to the server, automatic association is processed. As a result, for example, files created or edited in large quantities by a plurality of people such as in-house documents can be efficiently managed without bothering the creator or the editor.
[0015]
Yet another embodiment of the present invention also relates to a document management apparatus. The apparatus includes a storage unit that stores documents, a condition setting unit that determines a word / phrase as a viewpoint for finding relevance between documents, and an extraction that extracts a plurality of documents including the determined word / phrase from the storage unit. And a thread management unit that recognizes the existence of a context between documents by linking a plurality of extracted documents and linking a combination of adjacent documents in chronological order by linking the links. And an output unit for presenting a document whose existence is recognized.
[0016]
Even in this mode, a large amount of files can be managed efficiently, and the causal relationship between documents can be presented in a form that can be grasped more intuitively by associating them in chronological order.
[0017]
This device includes both a case where it points to either a server or a user terminal installed on a network, and a case where it points to a system consisting of a server and a user terminal connected via a network. The functional blocks of the extraction unit, thread management unit, and output unit may be provided on the server side or the user terminal side. It may be provided in both the server and the user terminal, and each functional block in that case may have the same name. Each of these functions is provided in the form of a program module, and may be executed by either or both of the server and the user terminal, or may be downloaded from the server to the user terminal for execution.
[0018]
It should be noted that any combination of the above-described constituent elements, or the constituent elements and expressions of the present invention may be mutually replaced between methods, apparatuses, systems, computer programs, recording media storing computer programs, data structures, etc. This is effective as an embodiment of the present invention.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
In the present embodiment, a plurality of documents are associated with each other based on the commonality of included words, and the context between documents is detected based on a plurality of relevance distributions and linkage modes. Furthermore, each document is classified according to the context mode, and this is handled as one of the search conditions. Thereby, relevance from various viewpoints can be found from a plurality of documents that are not associated in advance, and information serving as a substitute for the search index is generated from the tendency of the relevance. In addition, as long as words can be extracted from a document, they can be managed regardless of the data format.
[0020]
FIG. 1 is a diagram illustrating a configuration of a document management system according to an embodiment. The document management system 100 includes a plurality of user terminals 104 and a document management server 10 connected via a network 102. The document management server 10 may be configured to include a web server, and the user terminal 104 may be configured to include a personal computer and client software installed therein.
[0021]
The document management server 10 can be realized in hardware by elements such as a CPU of a computer, and is realized in software by a program having a language analysis function and a data management function. So, functional blocks that are realized by their cooperation are drawn. Therefore, these functional blocks can be realized in various forms by a combination of hardware and software.
[0022]
FIG. 2 is a functional block diagram showing the basic configuration of the document management apparatus according to the embodiment. The document management server 10 includes an input unit 12, a storage unit 14, a condition setting unit 16, a document management unit 22, an output unit 36, and a communication unit 42. Each unit transmits and receives data to and from a terminal on the network via the communication unit 42. The communication unit 42 may have a router function and a server function in addition to the data transmission / reception function.
[0023]
The storage unit 14 stores a plurality of documents. A document is stored in the storage unit 14 via the input unit 12. The input unit 12 is an interface for data input, and may process data input via another data port in addition to input via the communication unit 42. The input unit 12 may provide a user with a workspace such as a portal site or an electronic bulletin board for inputting and outputting documents. When the input unit 12 captures an external document, the user may specify a domain on the network as a capture range.
[0024]
The condition setting unit 16 has a function of determining a word / phrase (hereinafter referred to as a “view word / phrase”) as a viewpoint for finding relevance between documents, and includes a document analysis unit 18 and a word / phrase determination unit 20. The document analysis unit 18 analyzes the contents of the document stored in the storage unit 14 and extracts characteristic phrases. For example, a common phrase included in each document may be extracted, or a noun phrase including a proper noun may be extracted. A phrase may be extracted in a form including modifiers such as adjectives. Weighting may be performed according to whether or not the word is included in the document title, or the word may be extracted after the document is summarized. The priority of the extracted words may be set in advance.
[0025]
The document analysis unit 18 may determine a document range to be analyzed in accordance with the size of the workspace or division, or may analyze a document in a domain designated by the user.
[0026]
The phrase determination unit 20 determines the viewpoint phrase based on the extraction result from the document analysis unit 18 or the instruction from the user. The user instruction is received via the network 102 via the communication unit 42.
[0027]
The document management unit 22 includes an extraction unit 24, a time series processing unit 26, and a thread management unit 28. The time series processing unit 26 rearranges the plurality of documents stored in the storage unit 14 in time series. At this time, it may be rearranged based on time attributes such as creation date and update date and time of each document, or may be rearranged based on a history separately managed with respect to the history of document generation. The history may be recorded through a workspace provided by the input unit 12. The history may be stored in the storage unit 14.
[0028]
The extraction unit 24 extracts, from the storage unit 14, a plurality of documents that include the word / phrase determined by the condition setting unit 16. At this time, the extracting unit 24 combines documents adjacent to each other when linked in time series among a plurality of documents including the same word / phrase.
[0029]
The thread management unit 28 includes a link processing unit 30, a context recognition unit 32, and a classification processing unit 34. The link processing unit 30 associates viewpoint words / phrases that are common between the documents to be linked by describing them in the document by a link method such as XLink. The object to be linked may be based on an element surrounded by tags in the XML document. The context recognition unit 32 recognizes the existence of a context between a series of documents by linking a plurality of links as threads. Information for specifying a thread or context is sent to the classification processing unit 34.
[0030]
The classification processing unit 34 classifies a plurality of documents associated with a series of contexts according to the context mode recognized from each document. Each document usually has a plurality of contexts, and the context mode is classified into several types based on a certain tendency seen in the link form with the preceding and following documents. The type is saved as an attribute and handled so that it can be specified as one of the search conditions later. The attribute may be described in each document, or may be collectively saved in the storage unit 14 as management data. As a result, it is possible to narrow down the search and remove noise in an unprecedented form. The context type will be described later.
[0031]
The output unit 36 has a function of presenting a plurality of documents together with a context, and includes a search processing unit 38 and a display processing unit 40. The search processing unit 38 acquires a user designation regarding the context mode as one of the search conditions, extracts a document classified into the specified mode from the storage unit 14, and presents it as a search result. The display processing unit 40 has a function of sending data for displaying the recognized context on the screen in the form of a thread to the user terminal 104 and a function of sending a search result by the search processing unit 38 to the user terminal 104. The display processing unit 40 may provide a context correction function based on a user instruction through a screen for displaying the context.
[0032]
FIG. 3 is a diagram illustrating a context between documents linked by a phrase link included in each document. A plurality of contexts flow in the series of documents, and the documents are classified into several types according to the context modes recognized from the respective contexts. The classification processing unit 34 finds a feature point related to a context mode in each document. Specifically, the documents are classified based on whether the context mode corresponds to a start point, a merge point, a passage point, a branch point, or an end point in the document.
[0033]
In this figure, viewpoint words and phrases “Company A” and “Product B” are extracted from each document and links are set. Each document is arranged in time series. The words “Company A” and “Product B” are not extracted from the document before the first document 50 and are not linked to the previous document. Therefore, it is determined that the first document 50 is a type that has a strong tendency to become a context start point.
[0034]
The phrases “Company A” and “Product B” included in the first document 50 are linked to the phrases “Company A” and “Product B” included in the second document 52, respectively, which are further linked to the third document. 54 are linked to the words “Company A” and “Product B”. Therefore, it is determined that the second document 52 is a type that has a strong tendency to become a passage point of context.
[0035]
The phrase “Company A” included in the third document 54 is linked to the phrase “Company A” included in the fourth document 56, but the phrase “Product B” included in the third document 54 is the fifth The word “product B” included in the document 58 is linked. Therefore, the third document 54 is determined to be a type that has a strong tendency to become a context branch point. This type may be positioned as a document that has made a decision, or may be positioned as a possibility of being a minutes by further subdividing the branching mode.
[0036]
The phrase “Company A” included in the fourth document 56 and the phrase “Product B” included in the fifth document 58 are linked to the phrases “Company A” or “Product B” included in the sixth document 60, respectively. Is done. Therefore, it is determined that the sixth document 60 is a type that has a strong tendency to become a confluence of contexts. This type may be positioned as a document that is a convergence or milestone of the project.
[0037]
The phrases “Company A” and “Product B” included in the sixth document 60 are linked to the phrases “Company A” and “Product B” included in the seventh document 62, respectively, and are linked to subsequent documents. Not. Therefore, it is determined that the seventh document 62 is a type that has a strong tendency to become the end point of the context.
[0038]
Although only two of the first context 64 and the second context 66 are shown in the figure, in practice, many context recognitions are possible. There may be a case where the tendency of the aspect becomes clearer as the number of contexts to be recognized increases. The classification processing unit 34 in FIG. 2 may hold the classification criteria for such tendencies in advance.
[0039]
FIG. 4 is a diagram illustrating a state in which a plurality of documents arranged in time series are associated by a plurality of contexts. The vertical axis is the workspace axis, and the horizontal axis is the time axis. The workspace may be, for example, a space indicating the spread of each department or project in the company, or may be a space formed according to the classification of documents to be managed by this apparatus.
[0040]
The entire context space shown in the figure may be displayed on the screen of the user terminal 104. In that case, for example, when the user designates a specific viewpoint phrase, only the corresponding context may be highlighted. The user may select one of a plurality of contexts, and the display of the selected context may be switched to a tree-like thread form.
[0041]
FIG. 5 is a diagram illustrating an example of a screen that displays threads for each phrase in a tree shape. In the first thread 70, the title and creation date of each document included in the first context 64 of FIGS. 3 and 4 are displayed in a tree shape. Similarly, in the second thread 72, the title and creation date of each document included in the second context 66 are displayed in a tree shape. Thereby, a plurality of documents related to the phrase specified by the user can be visualized together with the chronological relationship.
[0042]
FIG. 6 is a flowchart showing processing executed by the document management apparatus according to the embodiment. A plurality of documents stored in the storage unit 14 are rearranged in chronological order (S10), and viewpoint words / phrases for finding a context between documents are determined (S12). A document to be linked including the viewpoint word / phrase is extracted (S14), and a link between words / phrases is added in a combination of documents that follow in time series (S16). A context in which a plurality of links are linked is recognized (S18), and the documents are classified according to the mode (S20). With the above processing, the preprocessing that is the premise of the search is completed.
[0043]
A search condition related to the classification of the document is acquired from the user together with the search keyword (S22), and the search is executed based on the search condition (S24). The document extracted as the search result is presented on the screen of the user terminal 104 (S26).
[0044]
The present invention has been described based on the embodiments. This embodiment is an exemplification, and it is understood by those skilled in the art that various modifications can be made to each component and combination of processing processes, and such modifications are within the scope of the present invention. Hereinafter, modifications will be described.
[0045]
In the embodiment, a link is set between documents containing the same word / phrase, but in a modified example, strictness is not required with respect to the identity of the word / phrase, the word / phrase is normalized for each document, a synonym dictionary or a controlled word dictionary May be used to provide flexibility.
[0046]
In FIG. 6, the process of rearranging the documents in chronological order is located before S12. However, as a modification, the process may be performed between S12 and S14. Alternatively, it is possible to adopt a procedure in which the rearrangement process is executed between S14 and S16, and the documents are rearranged in time series after extraction.
[0047]
In the embodiment, the context classification is handled as one of the search conditions, but in the modified example, the search is performed by adding a weight according to the number of contexts flowing in the document or a weight according to the relationship between the authority and the hub. You may prioritize the results.
[0048]
When posting data on an electronic bulletin board is input from the input unit 12, it is possible that a thread different from the thread managed on the electronic bulletin board is formed. On the electronic bulletin board, when a series of data that belongs to one thread is actually divided on the way, in the embodiment, it can be automatically divided into different threads.
[0049]
【The invention's effect】
According to the present invention, it is possible to improve the convenience of the user regarding the use of the accumulated information.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration of a document management system according to an embodiment.
FIG. 2 is a functional block diagram showing a basic configuration of a document management server according to the embodiment.
FIG. 3 is a diagram showing a context between documents linked by a phrase link included in each document;
FIG. 4 is a diagram illustrating a state in which a plurality of documents arranged in time series are associated by a plurality of contexts.
FIG. 5 is a diagram illustrating an example of a screen that displays a thread for each phrase in a tree shape.
FIG. 6 is a flowchart showing processing executed by the document management server according to the embodiment.
[Explanation of symbols]
10 document management server, 14 storage unit, 16 condition setting unit, 20 phrase determination unit, 24 extraction unit, 28 thread management unit, 36 output unit.

Claims

A step of determining a word / phrase as a viewpoint for the CPU to find relevance between documents;
CPU comprises the steps of extracting a plurality of documents containing phrase to the determined contents from the storage device, the CPU among a plurality of documents the extraction, together with associated by the time the link combinations of adjacent documents in sequential order, Recognizing the existence of a context between documents by chaining multiple links ;
An output means presenting a document in which the existence of the context is recognized;
Have a,
The recognizing step classifies the plurality of documents based on whether the context mode recognized from each of the plurality of documents corresponds to a start point, a merge point, a passage point, a branch point, or an end point in the document. Including
The presenting step includes a step of obtaining a user designation regarding the context mode, extracting a document classified into the specified mode from the storage device, and presenting the document.

A storage unit for storing documents;
A condition setting unit that determines a word / phrase as a viewpoint for finding relevance between documents;
An extraction unit that extracts a plurality of documents including the determined word or phrase from the storage unit;
Among the plurality of extracted documents, a thread management unit that associates a combination of documents adjacent in chronological order with links, and recognizes the presence of a context between documents by linking a plurality of links;
An output unit for presenting a document in which the existence of the context is recognized;
Have a,
The thread management unit classifies the plurality of documents based on whether the aspect of the context recognized from each corresponds to a start point, a junction point, a passage point, a branch point, or an end point in the document,
The document management apparatus , wherein the output unit acquires a user designation regarding the context mode, extracts a document classified into the specified mode from the storage unit, and presents the document.

Determining a word or phrase as a viewpoint for finding relevance between documents;
Extracting a plurality of documents containing the determined words in content;
Recognizing the existence of context between documents by associating a combination of adjacent documents in a chronological order among the extracted documents and linking a plurality of links ; and
Presenting a document in which the presence of the context is recognized;
To the computer ,
The recognizing step classifies the plurality of documents based on whether the context mode recognized from each of the plurality of documents corresponds to a start point, a merge point, a passage point, a branch point, or an end point in the document. Including
The presenting step includes a step of obtaining a user's designation regarding the context aspect, extracting and presenting a document classified into the designated aspect .