JP3910901B2

JP3910901B2 - Document structure search method, document structure search apparatus, and document structure search program

Info

Publication number: JP3910901B2
Application number: JP2002285543A
Authority: JP
Inventors: 拓也金輪; 優鈴木; 庄三磯部; 光生布目; 顕司小野; 雅一服部
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-09-30
Filing date: 2002-09-30
Publication date: 2007-04-25
Anticipated expiration: 2022-09-30
Also published as: JP2004126640A

Description

【０００１】
【発明属する技術分野】
本発明は、階層構造を有するデータベースに記憶されている複数の構造化文書の異なる文書構造の中から所望の文書構造を検索するための文書構造検索方法および構造化文書検索装置に関する。
【０００２】
【従来の技術】
ＸＭＬ（ＥｘｔｅｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）で記述された文書のような構造化文書を格納するデータベースとして、階層構造を有し、文書構造の異なる複数の構造化文書のそれぞれを１つの大きなツリー構造上の一部分文書として格納して管理するものがある。
【０００３】
ＸＭＬで記述された文書（以下、ＸＭＬ文書、ＸＭＬデータなどとも呼ぶ）などの構造化文書の文書構造のことをスキーマと呼ぶ。スキーマはデータ作成者が、明示的にスキーマ言語（ＸＭＬ−ＳＣＨＥＭＡやＲｅｌａｘＮＧなど）を用いて規定することも可能である（スキーマ言語によりスキーマを記述したものをスキーマ文書とよび、これも構造化文書である）が、一般的にスキーマ言語によりスキーマを記述するのは手間がかかる。更に、ＸＭＬはＳＧＭＬと異なり、データに対してスキーマ言語によってスキーマを規定する必要はなく、あくまで、オプションの位置付けである。
【０００４】
なお、上記のような階層構造をもつＸＭＬデータベースとは異なり、フォルダ単位に各ＸＭＬ文書を格納するフラットな（階層化されていない）データ構造をもつＸＭＬデータベースも存在するが、この場合は、明らかにスキーマはフォルダ単位で格納されており、階層化されて管理されていない。
【０００５】
階層構造をもち、１つの大きなツリー構造にてＸＭＬ文書を管理するＸＭＬデータベース（以下、階層化ＸＭＬデータベースと呼ぶ）の場合、各構造化文書の文書構造は、階層構造の一部として管理されることになる。
【０００６】
スキーマは必ずしも予め定めておく必要はないが、階層化ＸＭＬデータベースに格納されている全てのＸＭＬ文書がそれぞれ異なる文書構造をもつとなれば、管理しずらくなることは否めない。格納文書を効率よく管理する上からも、内容的に類似する（同じカテゴリに属する）ＸＭＬ文書は、全く異なる文書構造であるよりも、同じ文書構造である方が望ましい。
【０００７】
ＸＭＬ文書作成者が各自ばらばらの文書構造によりＸＭＬ文書を作成した場合、文書構造が氾濫し、階層かＸＭＬデータベースは単なる検索エンジン以上の利用方法が行えなくなり、ＸＭＬのメリットが失われる恐れがある。
【０００８】
すなわち、階層化ＸＭＬデータベースにおいては、スキーマの氾濫を極力抑えることが望ましい。このためにＸＭＬ文書を作成する前に階層化ＸＭＬデータベース中の内容的に類似するＸＭＬ文書のもつスキーマを検索し、そのスキーマを利用してデータを作成する、ということが考えられる。
【０００９】
しかしながら、従来の階層化ＸＭＬデータベースでは、以下の２つの問題点が存在する。
【００１０】
第１の問題点としては、階層ＸＭＬデータベースにおいて、所望の文書構造のみを検索する有効な手段がないということである。全ての格納文書に対してスキーマ言語によりスキーマを規定しておけば、これらをデータベースで管理することで、スキーマ検索は容易になるのだが、スキーマ言語の規定はあくまでオプションであるので、この場合だと、登録されたスキーマ言語しか検索されないことになる。
【００１１】
これとは別の方法で、曖昧な構造を指定してデータをＸＱｕｅｒｙなどで検索した結果のＸＭＬデータを吟味して構造をユーザの目で抽出する方法も考えられるが、この場合のコストも大きい。欲しいのはスキーマであって、データではないからである。
【００１２】
例えば、任意の要素名を検索条件に指定し、検索結果として文書とともにその文書が従うＤＴＤ（Document Type Definition）情報も返すというものある（例えば、特許文献１参照）。しかしながら、この手法によれば、ＤＴＤに従わない文書に対してはその情報を得ることはできず、また、その粒度も文書単位であり、ＤＴＤの構成要素のみを返す（部分文書のスキーマのみを返す）といったことはできない。
【００１３】
また、種文書から類似する文書を検索して表示させるというものもある（例えば、特許文献２参照）。ここに開示されている手法は、取得した文書がどのスキーマに従うか、などは目視に頼らざるを得ず、また、検索結果としてはやはり文書単位であるので、部分文書に含まれるスキーマなどの抽出においては類似度が低くなるなどの問題点が残る。
【００１４】
さらに、新規に登録される要素名、属性名をテーブルで管理するといものもある（例えば、特許文献３参照）。しかし、この手法は、ＸＭＬ文書をフラット管理するデータベースのみで有効な手法で、文書構造が階層的に管理されている階層化ＸＭＬデータベースについては全く考慮にいれていない。
【００１５】
第２の問題点としては、ユーザが文書構造を検索しようとする際に、どのように検索を行えばよいか、どのような結果があればよいか、など、検索対象が曖昧で漠然としている場合がほとんであるから、このような場合に対処可能なように、曖昧性を含んだ文書構造の検索条件の指定方法が存在しなかった。
【００１６】
どのような検索を行うか、に関して言えば、例えば、「本」情報のスキーマを検索したい場合は、一般的には、本という要素を定義したスキーマ要素を検索する。しかしながら、正確に「本」と定義されておらず、「書物」とか「Ｂｏｏｋ」などを定義したスキーマで検索して欲しい場合もある。
【００１７】
更に漠然とキーワードだけから、まとめあげるようなスキーマを検索したいという要求もある。この場合は、このキーワードが意味するものが要素であるか、属性であるか、値であるかなどの判別がまるでなく、漠然としたデータを格納するスキーマをなんでもよいから検索せよ、という意図が込められている。
【００１８】
このように、ユーザはスキーマを検索する時点においては、漠然としたもので、確定したものではなく、これを吸収できる仕組みにするべきである。
【００１９】
従来ある検索手法では（例えば、特許文献２参照）、タグ名に完全一致しない限りは検索として出力されなかった。予めＤＴＤを指定すれば、それに従った構造を示してくれるが、スキーマを検索するという本発明の主眼からは外れる。
【００２０】
また、どのような結果、に対しても、単なる、スキーマ言語を返すのではなく、ユーザが利用したい情報も付加するべきである。
【００２１】
スキーマ言語によりスキーマは規定しなくても、データをＸＭＬ文書として管理しようと考える際には、ユーザは暗黙的にスキーマを想定し、それに従って文書を作成する場合が多い。更に、似たスキーマを持つデータは階層化されたデータベース中の同一のパスに並べて管理する傾向が強い。これらは暗黙のスキーマ定義であると見なせ、スキーマ言語による明示的スキーマ定義と同様に管理する方法が必要である。
【００２２】
また、ユーザの意図に応じてスキーマ検索（文書構造の検索）の用途が異なるが、主に２通りに分かれる。
【００２３】
１つ目は、ある程度検索したいスキーマを予測できる場合である。例えば、「特許関連のスキーマ」を検索する場合は、抽出すべきスキーマは「特許」情報のＸＭＬ文書を規定したスキーマに類似するようなスキーマが検索される。
【００２４】
２つ目は、データは存在しているがどういう構造にして、どういう風にまとめあげるか、という視点すらまだ決まっていないが、近いスキーマを検索して欲しい、というものである。この場合、格納したいデータから、その関連でいかに有益なスキーマをユーザに提示してやるかが主眼となる。
【００２５】
検索結果にしても、階層化データベースであり、スキーマも階層的に配置されることが前提となる。
【００２６】
【特許文献１】
特開２０００−２００２８６号公報
【００２７】
【特許文献２】
特開２００１−１４３２６号公報
【００２８】
【特許文献３】
特開２００２−７４３９号公報
【００２９】
【発明が解決しようとする課題】
このように、従来は、階層構造を有するＸＭＬデータベースから文書構造のみを検索条件の曖昧性を加味しながら効率よく検索する手法が存在しなかった。
【００３０】
そこで、本発明は、階層構造を有するＸＭＬデータベースから漠然とした検索条件を基に文書構造を効率よく検索することのできる文書構造検索方法、文書構造検索装置およびプログラムを提供することを目的とする。
【００３１】
【課題を解決するための手段】
本発明は、階層構造を有するデータベースに記憶されている複数の構造化文書の異なる文書構造の中から所望の文書構造を検索するためのものであって、前記階層構造は、前記異なる文書構造のそれぞれを構成する各構成要素で構成されており、前記文書構造を検索するための検索条件として、少なくとも１つのキーワードが入力されると、当該キーワードの類似語を求めて、前記階層構造から前記キーワードと前記類似語のうちのいずれかを要素名あるいは要素値に含む構成要素を検索し、この検索された構成要素を含む文書構造を検索結果として出力することを特徴とする。
【００３２】
本発明によれば、少なくとも１つのキーワードを入力するだけで、当該キーワードとその類似範囲にある語のうちのいずれかを要素名や要素値にもつ構成要素を含む文書構造を効率よく求めることができる。
【００３３】
また、本発明は、階層構造を有するデータベースに記憶されている複数の構造化文書の異なる文書構造の中から所望の文書構造を検索するためのものであって、前記階層構造は、前記異なる文書構造のそれぞれを構成する各構成要素で構成されており、前記文書構造を検索するための検索条件として、少なくとも１つの第１のキーワードと、少なくとも１つの第２のキーワードが入力されると、前記第１のキーワードの類似語を求めて、前記階層構造から前記第１のキーワードとその類似語のうちのいずれかを要素名あるいは要素値に含む第１の構成要素を検索するとともに、前記第２のキーワードの類似語を求めて、前記階層構造から前記第２のキーワードとその類似語のうちのいずれかを要素名に含む第２の構成要素を検索し、少なくとも、前記第２の構成要素を先頭とする文書構造のうち、前記第１の構成要素を含む文書構造を検索結果として出力することを特徴とする。
【００３４】
本発明によれば、少なくとも１つの第１のキーワードと少なくとも１つの第２のキーワードを入力するだけで、当該第１のキーワードとその類似範囲にある語のうちのいずれかを要素名や要素値にもつ構成要素を含む文書構造であって、当該第２のキーワードとその類似範囲にある語のうちのいずれかを要素名にもつ構成要素を先頭とする文書構造を効率よく求めることができる。
【００３５】
【発明の実施の形態】
まず、本発明の実施形態について説明する前に、構造化文書管理システムの基本的な構成および処理動作について説明する。
【００３６】
（構造化文書管理システムの説明）
構造化文書として、ＸＭＬやＳＧＭＬなどで記述した文書が挙げられる。ＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｌｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）とは、ＩＳＯ（国際標準化機構）で定められた規格である。ＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）とは、Ｗ３Ｃ（ＷｏｒｌｄＷｉｄｅＷｅｂＣｏｎｓｏｒｔｉｕｍ）にて定められた規格である。これらは、文書を構造化することを可能とする構造化文書の規格である。
【００３７】
以下、構造化文書として、ＸＭＬにて記述された文書を例に説明を進める。構造化文書の文書構造を定義したデータ（文書構造定義データ）をスキーマと呼ぶ。ＸＭＬではスキーマを定義するためにＸＭＬ−ＳｃｈｅｍａやＸＤＲ（ＸＭＬＤａｔａＲｅｄｕｃｅｄ）などのスキーマ言語が提案されている。ここでは、例えば、ＸＤＲでスキーマを記述する場合を例にとり説明する。
【００３８】
スキーマも、構造化文書管理システムの管理対象の構造化文書であり、ここでは、スキーマ文書と呼ぶことがある。スキーマ文書以外の構造化文書であって、特許明細書やメール、週報、広告などの種々雑多な内容を有する文書を、ここでは、コンテンツ文書と呼ぶこともある。
【００３９】
構造化文書管理システムでは、上記スキーマ文書、上記コンテンツ文書、さらに、後述するようなユーザからの検索要求を記述したクエリ、すなわち、クエリ文書も管理対象とし、これらを総称して「文書」と呼ぶ。
【００４０】
以下、特にことわりがない場合、「文書」と呼ぶときは、コンテンツ文書、スキーマ文書、クエリ文書を全て指すものとする。
【００４１】
まず、実施形態の説明の前に、ＸＭＬについて簡単に説明する。
【００４２】
図３は、ＸＭＬで記述された構造化文書の一例として、「特許」情報の例を示したものである。ＸＭＬやＳＧＭＬは、文書の構造の表現にタグが用いられる。タグには、開始タグと終了タグがある。文書構造の各構成要素は、開始タグと終了タグで囲まれている。開始タグとは構成要素の要素名を「＞」で閉じたものであり、終了タグとは要素名を記号「＜／」と「＞」で閉じたものである。タグに続く構成要素の内容が、テキスト（文字列）または子供の構成要素の繰り返しである。また開始タグには「＜要素名属性＝“属性値”＞」などのように属性情報を設定することができる。「＜特許ＤＢ＞＜／特許ＤＢ＞」のようにテキストを含まない構成要素は、簡易記法として「＜特許ＤＢ／＞」のように表わすこともできる。
【００４３】
図３に示した文書は、「特許」タグから始まる要素をルートとし、その子要素として「タイトル」、「出願日」、「出願者」、「要約」タグから始まる要素が存在する。また、例えば、「タイトル」タグから始まる要素には「ＸＭＬデータベース」といった、１つのテキスト（文字列）が要素値として存在する。
【００４４】
ＸＭＬなどの構造化文書は、任意の構成要素を繰り返し含んでいたり、さらには文書構造があらかじめ決まっていないのが普通である。
【００４５】
図３に示したような構造化文書を論理的に表現するために、図４に示すようなツリー表現が用いられる。ツリーは、ノード（番号が付され、円形で示されたもの）とアーク（ノードを表す円形間をつなぐデータ付き線）と四角形で囲まれたテキストから構成されている。
【００４６】
１つのノードは１つの構成要素、すなわち、１つの文書オブジェクトに対応する。ノードからタグ名や属性名に相当するラベルが付与された複数のアークが出てきている。そのアークの先は、ノード値または要素値としての文字列（テキスト）である。ノードの中に記載されている英数字（例えば「＃０」、「＃４９」）などは、各文書オブジェクトを識別するためのオブジェクトＩＤである。
【００４７】
図４に示したツリー構造を図３に示した構造化文書の文書オブジェクトツリーと呼ぶ。
【００４８】
図１は、本実施形態に係る構造化文書管理システムの構成例を示したものである。図１において、構造化文書管理システムは、大きく分けて、要求制御部１、アクセス要求処理部２、検索要求処理部３、データアクセス部４、文書記憶部５、インデックス記憶部６から構成されている。文書記憶部５、インデックス記憶部６は例えば、外部記憶装置で構成される。
【００４９】
図１のシステム構成は、ソフトウエアを用いて実現可能である。
【００５０】
要求制御部１は、要求受付部１１と結果処理部１２から構成されている。要求受付部１１は、文書の格納、文書の取得、文書の検索などのユーザからの要求を受け付けて、アクセス要求処理部２を呼び出す。結果処理部１２は、アクセス要求処理部２が処理した結果を要求元のユーザに返す処理を行う。
【００５１】
アクセス要求処理部２は、文書の格納、文書の取得、文書の削除などのユーザからの各種要求に対応した複数の処理部から構成されている。つまり、文書格納部２１、文書取得部２２、文書削除部２３から構成されている。
【００５２】
文書格納部２１は、文書記憶部５中の指定された論理的なエリアに文書を格納する処理を行う。
【００５３】
文書取得部２２は、文書記憶部５中の論理的なエリアが指定されたときに、その指定エリアに存在する文書を取得する処理を行う。
【００５４】
文書削除部２３は、文書記憶部５中の指定された論理的なエリアに存在する文書を削除する処理を行う。
【００５５】
文書記憶部５は、構造化文書データベースであり、例えば、図８に示すように、文書をＵＮＩＸのディレクトリ構造のように階層的にツリー構造状に格納している。
【００５６】
図８に示すように、構造化文書データベースは、図４に示したような１つの構造化文書のツリー構造と同様に表現できる。すなわち、任意のノード以下の部分的な階層木（部分ツリー）は、構造化文書データベースから切り出された構造化文書であり、ここでは、これを文書オブジェクトツリーと呼ぶ。各ノードにはオブジェクトＩＤが割り当てられている。オブジェクトＩＤは、構造化文書データベース内ではユニークな数値である。
【００５７】
階層木のルートとなるノードには、それがルートノードであることを特定するためのオブジェクトＩＤ「＃０」が割り当てられるものとする。
【００５８】
ルートノード、すなわち、「＃０」のノードからは「ｒｏｏｔ」タグを先頭に持つオブジェクトＩＤ「＃１」のノードへリンクが張られている。「＃１」のノードからは、「特許ＤＢ」タグを先頭にもつオブジェクトＩＤ「＃２」のノードへのリンクが張られている。「＃２」ノードからは、「特許」タグを先頭に持つ、オブジェクトＩＤ「＃４２」のノード、「＃５２」のノード、「＃６２」のノードへのリンクがそれぞれ張られている。
【００５９】
図３に示した「特許」情報は、図８の「＃４２」ノード以下の部分ツリーに対応している。このノードからは「タイトル」タグ、「出願者」タグ、「要約」タグなどを先頭にもつノードへリンクが張られ、末端のノードからは、「ＸＭＬデータベース」、「Ｔ社」、「ＸＭＬを統一的に管理するデータベースを提供する…」などの文字列（要素値）へのリンクが張られている。
【００６０】
図８において、オブジェクトＩＤ「＃５２」のノード以下の部分ツリー、オブジェクトＩＤ「＃６２」のノード以下の部分ノードも１つの「特許」情報に対応する文書オブジェクトツリーである。
【００６１】
ところで、例えば、「＃４３」ノードにリンクされた「ＸＭＬデータベース」という要素値は、「＃４３」ノードと「＃ｖａｌｕｅ」という特殊なタグ名で接続されている。このタグ名は、「＃」で始まるためＸＭＬの規格においては標準的なタグ名として利用することはできない。
【００６２】
このような構造化文書データベースの特定ノードを指定するために構造化文書パスを用いる。構造化文書パスは「ｕｉｘ：／／ｒｏｏｔ」から始まる文字列である。ｕｉｘ（ＵｎｉｖｅｒｓａｌＩｄｅｎｔｉｆｉｅｒｆｏｒＸＭＬ）は構造化文書パスであることを示す文字列である。
【００６３】
例えば、構造化文書パスとして「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ」と表すと、この構造化文書パスの示す文書記憶部５中の論理的なエリアは、図８において、「＃１」ノードから「特許ＤＢ」が付与されたアークが指し示すノード、つまり「＃２」ノードである。
【００６４】
同様にして、構造化文書パス「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許」は、図８における「＃４２」ノードを指し示し、構造化文書パス「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／出願日／年」は、図８における「＃４５」ノードを指し示す。
【００６５】
例えば、図８において、「＃２」ノード以下に、すなわち、「特許ＤＢ」という構成要素に、複数の「特許」情報を格納する場合には、各「特許」情報を識別するために、要素名（例えば、この場合「特許」）にインデックスを追加してもよい。
【００６６】
「特許ＤＢ」の最初の「特許」情報であれば、「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許［０］」となるが、これは「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許」と同じとみなす。「特許ＤＢ」の２番目の「特許」情報であれば、「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許［１］」、「特許ＤＢ」の５番目の「特許」情報であれば、「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許［４］」となる。
【００６７】
インデックス記憶部６には，検索時に用いる、要素名称生起インデックスとデータ生起インデックスが記憶されている。
【００６８】
要素名生起インデックスとは構造化文書データベースに格納されている要素名と、その要素名の構成要素が先頭にある構造化文書（文書オブジェクトツリー）の位置とを関連付けたインデックスファイルである。例えば、図８の構造化文書データベースでは、（「特許」情報に対応する）「特許」という要素名が「＃４２」ノード以下の構造化文書、「＃５２」ノード以下の構造化文書、「＃６２」ノード以下の構造化文書に存在する場合、要素名生起インデックスには、図９に示すように、「＃４２」ノード、「＃５２」ノード、「＃６２」ノードの親ノード、すなわち、「＃２」ノードが、要素名「特許」にリンクされて格納される。
【００６９】
このように、親ノードでインデックス化すると、インデックスファイルを圧縮することができる。すなわち、親ノードでインデックス化すれば、子ノードが増大しようとも、親ノードで代用しているので、要素名にリンクすべきノードは増大しない。
【００７０】
データ生起インデックスとは、構造化文書データベースに格納されている文字列データと、その文字列データが存在する構造化文書（文書オブジェクトツリー）の位置とを関連付けたインデックスファイルである。例えば、図８の構造化文書データベースでは、「ＸＭＬ」という文字列が「＃４３」ノード以下の構造化文書、「＃４９」ノード以下の構造化文書に存在する。この場合、データ生起インデックスには、図１０に示すように、「＃４３」ノード、「＃４９」ノードが、「ＸＭＬ」という文字列にリンクされて格納される。
【００７１】
文書記憶部５中の指定された論理的なエリアとは、構造化文書パスを用いてユーザにより指定された文書の格納場所である。構造化文書パスは、ユーザにとって認識可能な表現である。
【００７２】
図１の説明に戻る。
【００７３】
データアクセス部４は、文書記憶部５をアクセスするための各種処理を行うものである。データアクセス部４は、文書オブジェクトツリー格納部４１、文書オブジェクトツリー削除部４２、文書オブジェクトツリー取得部４３、文書文字列取得部４４、文書パーサ部４６、合成文書作成部４７、インデックス更新部４８から構成される。
【００７４】
文書オブジェクトツリー格納部４１は、文書記憶部５中の指定された物理的なエリアに文書オブジェクトツリーを格納するための処理を行う。
【００７５】
文書オブジェクトツリー削除部４２は、文書記憶部５中の指定された物理的なエリアに存在する文書オブジェクトツリーを削除するための処理を行う。
【００７６】
文書オブジェクトツリー取得部４３は、文書記憶部５中の（構造化文書パスなどにより）指定された物理的なエリアに存在する文書オブジェクトツリーを取得するための処理を行う。
【００７７】
文書文字列取得部４４は、文書オブジェクトツリーを構造化文書（ＸＭＬ文書）に変換するための処理を行う。
【００７８】
文書パーサ部４６は、ユーザにより入力された構造化文書を読み込んで、その文書構造の検査を行う。さらに文書構造の定義データであるスキーマが存在すれば、入力された構造化文書の文書構造がスキーマにしたがっているかどうかの検証を行う。出力結果は文書オブジェクトツリーとなる。文書パーサは、通常、ｌｅｘ（ｌｅｘｉｃａｌａｎａｌｙｚｅｒｇｅｎｅｒａｔｏｒ）といったレキシカルアナライザ（字句解析を行い，トークンに分解する）とｙａｃｃ（ｙｅｔａｎｏｔｈｅｒｃｏｍｐｉｌｅｒｃｏｍｐｉｌｅｒ）といったパーサジェネレータを組み合わせて構築することができる。
【００７９】
合成文書作成部４７は、文書の格納や文書の削除などをする際に、スキーマに合致しているかどうか検査しなければならないが、この検査時に必要となるデータを作成する。
【００８０】
インデックス更新部４８は、文書の格納や文書の削除などにより、構造化文書データベースの格納内容が更新されるたびに、図９、図１０に示した要素名称生起インデックスとデータ生起インデックスを更新する。
【００８１】
文書記憶部５中の物理的なエリアとは、ファイルオフセットやオブジェクトＩＤなどの構造化文書データベース内ではユニークな文書データの存在場所を指し示す内部データである。ユーザにとっては認識不可能なデータである。
【００８２】
検索要求処理部３は、データアクセス部４に備わっている各処理機能部を用いて、文書記憶部５中に格納された文書を検索する処理を行う。要求制御部１の要求受付部１１でユーザからの文書検索の要求が受け付けられると、検索要求処理部３には、要求受付部１１からクエリ言語で記述されたクエリ文書が入力する。そしてデータアクセス部４を通してインデックス記憶部６，文書記憶部５にアクセスし、検索要求に合致する文書の集合を取得して、その結果を結果処理部１２を介して出力する。
【００８３】
図２は、図１に示した構造化文書管理システムの一利用形態を示したもので、図２では、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）のバックエンドで、図１に示した構成の構造化文書管理システム１００が動作している場合を示している。
【００８４】
複数（ここでは、例えば３つ）のクライアント端末（例えばパーソナルコンピュータ、携帯通信端末など）１０２のそれぞれでＷＷＷブラウザ１０３が動作している。ユーザは、各クライアント端末からＷＷＷサーバ１０１にアクセスすることにより、構造化文書管理システム１００にアクセスすることができる。ＷＷＷブラウザ１０３とＷＷＷサーバ１０１とは、ＨＴＴＰ（ＨｙｐｅｒＴｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）で通信している。また、ＷＷＷサーバ１０１と構造化文書管理システム１００とは、ＣＧＩ（ＣｏｍｍｏｎＧａｔｅｗａｙＩｎｔｅｒｆａｃｅ）またはＣＯＭ（ＣｏｍｐｏｎｅｎｔＯｂｊｅｃｔＭｏｄｅｌ）などで通信している。
【００８５】
文書の格納、文書の取得、文書の検索などのユーザからの要求は、ＷＷＷブラウザ１０３から送信されて、ＷＷＷサーバ１０１を通して構造化文書管理システム１００にて受け付けられる。構造化文書管理システム１００にて処理された結果は、ＷＷＷサーバ１０１を通して要求元のＷＷＷブラウザ１０３へ返信される。
【００８６】
以下、図１の構造化文書管理システムの（１）格納機能、（２）検索機能について詳細に説明する。
【００８７】
（格納機能）
図１の構造化文書管理システムにおける格納系のコマンドには以下のものがある。
【００８８】
ｉｎｓｅｒｔＸＭＬ（パス、Ｎ番目、ＸＭＬ）：文書格納
ａｐｐｅｎｄＸＭＬ（パス、ＸＭＬ）：文書格納
ｇｅｔＸＭＬ（パス）：文書取得
ｒｅｍｏｖｅＸＭＬ（パス）：文書削除
ｓｅｔＳｃｈｅｍａ（パス、スキーマ）：スキーマ格納
ｇｅｔＳｃｈｅｍａ（パス）：スキーマ取得
「ｉｎｓｅｒｔＸＭＬ」は、（）内に指定した構造化文書パス以下のＮ番目に文書を挿入するコマンド（以下、簡単に挿入コマンドと呼ぶ）である。
【００８９】
「ａｐｐｅｎｄＸＭＬ」は、（）内に指定した構造化文書パス以下の最後に文書を挿入するコマンド（以下、簡単に追加コマンドと呼ぶ）である。
【００９０】
「ｇｅｔＸＭＬ」は、（）内に指定した構造化文書パス以下の文書を取り出すコマンド（以下、簡単に取得コマンドと呼ぶ）である。
【００９１】
「ｒｅｍｏｖｅＸＭＬ」は、（）内に指定した構造化文書パス以下の文書（スキーマ文書以外の文書で、主に、コンテンツ文書）を削除するコマンド（以下、簡単に削除コマンドと呼ぶ）である。
【００９２】
「ｓｅｔＳｃｈｅｍａ」は、（）内に指定した構造化文書パスにスキーマを設定するコマンド（以下、簡単にスキーマ格納コマンドと呼ぶ）である。
【００９３】
「ｇｅｔＳｃｈｅｍａ」は、（）内に指定した構造化文書パスに設定されているスキーマを取り出すコマンド（以下、簡単にスキーマ取得コマンドと呼ぶ）である。
【００９４】
上記コマンドのうち、挿入コマンド、追加コマンド、スキーマ格納コマンドについての処理はアクセス要求処理部２の文書格納部２１で実行され、取得コマンド、スキーマ取得コマンドについての処理は文書取得部２２で実行され、削除コマンドについての処理は文書削除部２３で実行される。
【００９５】
図５を参照して、構造化文書データベースの初期状態（図５（ａ）参照）において、追加コマンドを実行する場合について説明する。
【００９６】
図５（ａ）に示すように、「＃０」ノードと「＃１」ノードが「ｒｏｏｔ」アークで接続されている初期状態に対して、
「ａｐｐｅｎｄＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ”，“＜特許ＤＢ／＞”）」を実行した結果、図５（ｂ）に示すように、「＃２」ノードと「特許ＤＢ」アークが作成される。
【００９７】
図５（ｂ）に示した状態の構造化文書データベースに対して、取得コマンドを実行する場合について説明する。
【００９８】
例えば、「ｇｅｔＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ”）」を実行すると、図５（ｂ）の「ｒｏｏｔ」アークが示す「＃０」ノード以下の文書オブジェクトツリーが取り出され、それをＸＭＬ文書に変換する。その結果、「＜ｒｏｏｔ＞＜特許ＤＢ／＞＜／ｒｏｏｔ＞」なる文字列が取り出されて、図６に示すようなＸＭＬ文書に変換される。取得コマンドの処理は、アクセス要求処理部２の文書取得部２２にて実行される。
【００９９】
次に、図５（ｂ）に示した状態の構造化文書データベースに対して、図３に示すようなコンテンツ文書（ＸＭＬ文書）としての「特許」情報を格納するための追加コマンドを実行する場合について説明する。すなわち、この場合、「ａｐｐｅｎｄＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ”，“＜特許＞…＜／特許＞”）」を実行する。このコマンド中「“＜特許＞…＜／特許＞”」が、図３に示した「特許」情報のＸＭＬ文書に対応する。
【０１００】
上記追加コマンドの処理が実行されると、図７に示すように、「＃２」ノード以下に「＃４２」ノードをトップとする文書オブジェクトツリー（図４に対応）が追加される。
【０１０１】
図５（ｂ）に示した状態の構造化文書データベースに対して、次に示すような追加コマンドを３回繰り返して実行したとする。
【０１０２】
「ａｐｐｅｎｄＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ”，“＜特許＞…＜／特許＞”）」
上記コマンド中、「＜特許＞…＜／特許＞」は、図３に示したＸＭＬ文書と同じ文書構造のコンテンツ文書に対応する。
【０１０３】
すると、図８に示すように、「＃２」ノード以下に「＃４２」ノード、「＃５２」ノード、「＃６２」ノードをトップとする文書オブジェクトツリーが追加される。
【０１０４】
次に、図８に示した状態の構造化文書データベースに対して、３つの「特許」情報を取り出すための取得コマンドを実行した場合について説明する。この場合、「ｇｅｔＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ”）」を実行する。すると、「特許ＤＢ」アークが示す「＃２」ノード以下の文書オブジェクトツリーが取り出される。その結果、図１１に示すように、「＜特許ＤＢ＞＜特許＞…＜／特許＞＜特許＞…＜／特許＞＜特許＞…＜／特許＞＜／特許ＤＢ＞」なるＸＭＬ文書が取得できる。
【０１０５】
構造化文書データベースでは、上記の「特許」情報などのコンテンツ文書（ＸＭＬ文書）の文書構造を定義したデータ、すなわち、スキーマも管理対象とする。
【０１０６】
図１２は、ＸＭＬ文書の文書構造を定義するスキーマの一例を示したものである。ここでは、ＸＭＬの文書構造定義言語の一つであるＸＤＲ（ＸＭＬ−ＤａｔａＲｅｄｕｃｅｄ）を取り上げる。もちろん、ＸＭＬ−Ｓｃｈｅｍａなど他の文書構造定義言語を用いてもかまわない。
【０１０７】
図１２に示したスキーマは、図３に示した「特許」情報の文書構造をＸＤＲで定義したものである。図１２からも容易に分かるとおり、スキーマもＸＭＬ形式の構造化文書である。「Ｓｃｈｅｍａ」タグから始まる構成要素から始まり、その子要素として、「ＥｌｅｍｅｎｔＴｙｐｅ」タグから始まる要素集合が存在する。
【０１０８】
図８に示した状態の構造化文書データベースに対して、図１２に示したスキーマ文書を格納するためのスキーマ格納コマンドを実行する場合について説明する。この場合、「ｓｅｔＳｃｈｅｍａ（“ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ”，“＜Ｓｃｈｅｍａ＞…＜／Ｓｃｈｅｍａ＞”）」を実行する。このコマンド中、「“＜Ｓｃｈｅｍａ＞…＜／Ｓｃｈｅｍａ＞”」」が図１２に示したスキーマ文書に対応する。
【０１０９】
上記コマンドの実行により、図１３に示すように、「＃２」ノード以下に「＃ｓｃｈｅｍａ」アークが追加され、その先には、「＃３」ノードをトップノードとする文書オブジェクトツリーが追加される。スキーマ自身がＸＭＬ文書表現になっているため、前述した「特許」情報のようなコンテンツ文書格納のケースと同様に、図１３に示したように、ツリー展開される。
【０１１０】
図１３において、「＠ｎａｍｅ」のように、「＠」で始まるアークは属性に対応する。タグ名「＃ｓｃｈｅｍａ」も「＃」、「＠」で始まるためＸＭＬ規格においては標準的なタグ名として利用することはできない。
【０１１１】
「＃２」ノード以下に図１２に示したスキーマ文書が格納されたことにより、以後、「＃２」ノード以下に格納される文書の文書構造は、図１２に示したスキーマ文書により定義された文書構造に適合することを要求してもよい。すなわち、この場合、「＃２」ノード以下に図１２に示したスキーマが設定されることになる。
【０１１２】
「＃２」ノード以下に図１２に示したスキーマが設定されると、例えば、図１４に示すように、「＃２」ノード以下の文書オブジェクトツリーの各ノード（のファイル）には、スキーマが存在する旨の属性値がセットされる。
【０１１３】
「＃２」ノード以下に図１２に示したスキーマが設定された後に、このスキーマで定義された文書構造に一致する図３に示したような「特許」情報の文書を、図１４に示したように、文書オブジェクトツリーとして構造化文書データベースに格納したとき、この文書の文書構造には図１２に示したスキーマが存在する旨の属性値が、当該文書オブジェクトツリーを構成する各文書オブジェクトにセットされる。例えば、当該文書オブジェクトツリーを構成する各文書オブジェクトのファイルに対して、スキーマが存在している旨の属性値（例えば、「スキーマ適合有無」）に「１」がセットされる。図１４では、スキーマに適合している各文書オブジェクト（ノード）は２重丸で示している。２重丸で示した各文書オブジェクトには、その文書オブジェクトに対応した文書構造定義が存在することになる。
【０１１４】
図１５は、各文書オブジェクトのファイルの内容を概念的に示したもので、例えば、オブジェクトＩＤが「＃４２」の文書オブジェクトのファイルには、その文書オブジェクトにリンクされている他の文書オブジェクトに関する情報（例えば、アークや、リンク先の文書オブジェクトへのポインタ値など）とともに、上記属性値が記述されている。なお、当該文書オブジェクトに適用するスキーマが存在しないときは、「スキーマ適合有無」の値は「０」となる。
【０１１５】
図１６、図１７は、図１の構造化文書管理システムで、必要に応じて検索条件として用いるキーワードなどとして使用される語をその意味内容から階層的に分類した結果である概念階層を構造化文書で表現した例を示す。図１６、図１７に示す「概念」情報はＸＭＬで記述したコンテンツ文書である。
【０１１６】
図１６に示した「概念」情報の例は、いわゆる特許調査における特許文書の内容を分類するための１つの分類軸として用いる「情報モデル」を概念階層で表現している。「概念」タグで囲まれた「概念」情報は、入れ子構造を持った文書構造をもっている。つまり、図１６の例では、概念「情報モデル」の子供概念として、概念「ドキュメント」、概念「リレーション」、概念「オブジェクト」が存在している。また、概念「ドキュメント」の子供概念として、概念「構造化ドキュメント」、概念「非構造化ドキュメント」が存在する。さらに、概念「構造化ドキュメント」の子供概念として、概念「ＸＭＬ」、概念「ＳＧＭＬ」が存在している。
【０１１７】
図１７に示す「概念」情報の記述例は、図１６とは異なる分類軸「情報操作」を概念階層で表現している。図１７の例では、概念「情報操作」の子供概念として、概念「検索」、概念「格納」、概念「加工」、概念「流通」が存在している。
【０１１８】
図１６，図１７に示したような「概念」情報も、前述の「特許」情報と同様にして、構造化文書データベース内に格納することができる。すなわち、例えば、まず、図８に示した状態の構造化文書データベースに対して、「ａｐｐｅｎｄＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ”，“＜概念ＤＢ／＞”）」を実行して、図１８に示すように、「＃２０１」ノードと「概念ＤＢ」アークが作成される。この状態において、図１６に示した「概念」情報を格納する場合には、「ａｐｐｅｎｄＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ／概念ＤＢ”，“＜概念名前＞…＜／概念＞”）」を実行する。このコマンド中「“＜概念名前＞…＜／概念＞”」が、図１６に示した「概念」情報に対応する。
【０１１９】
上記追加コマンドの処理が実行されると、図１９に示すように、「＃２０１」ノード以下に「＃２０２」ノードをトップとする文書オブジェクトツリーが追加される。
【０１２０】
以上説明したように、図１の構造化文書管理システムでは、構造化文書データベース上に登録される文書構造が異なる膨大な数のＸＭＬ文書群（コンテンツ文書、スキーマ文書、クエリ文書など）を、図１８，図１９に示すように、「ｒｏｏｔ」タグを先頭に持つツリー状の１つの巨大なＸＭＬ文書として取り扱う。そのため、部分的なＸＭＬ文書をアクセスするには巨大なＸＭＬ文書に対するパスという、文書構造に依存しない統一的なアクセス手段を用いることにより、幅広くＸＭＬ文書を検索したり加工したりすることが可能になる。
【０１２１】
また、構造化文書データベース上の一部にスキーマを設定することで、格納しようとする文書の文書構造がそのスキーマにより定義されている文書構造に一致するか否かの妥当性のチェックを自動的に行うようにしてもよい。
【０１２２】
（検索機能）
図１の構造化文書管理システムにおける検索系のコマンドには以下のものがある。
【０１２３】
ｑｕｅｒｙ（ｑｌ）
「ｑｕｅｒｙ」は、パラメータとして（）内のクエリｑｌを実行し、その結果のＸＭＬ文書を取得するコマンド（以下、検索コマンドと呼ぶ）である。
【０１２４】
クエリは、例えば、図２０に示すように、ＳＱＬ（ＳｔｒｕｃｔｕｒｅｄＱｕｅｒｙＬａｎｇｕａｇｅ）に似た形式の言語により、検索位置、検索条件、情報抽出部分などを記述した、文書構造をもつＸＭＬ文書である。クエリ文書も構造化文書管理システムの管理対象である。
【０１２５】
「ｋｆ：ｆｒｏｍ」タグから始まる要素には、検索位置の指定と文書要素の値に変数を対応付ける記述があり、「ｋｆ：ｗｈｅｒｅ」タグのから始める要素には、変数に関する条件づけの記述があり、「ｋｆ：ｓｅｌｅｃｔ」タグから始まる要素には、検索結果の出力形式が記述される。
【０１２６】
検索には、単純検索と概念検索とがある。単純検索とは、クエリ中に指定された検索条件を満たす情報を検索・抽出するものであり、概念検索とは、クエリ中に指定された概念情報を利用して、クエリ中に指定された検索条件を満たす情報を検索・抽出するものである。
【０１２７】
図２１は、単純検索のクエリの例を示したものである。図２１のクエリは、例えば、図１４に示したような状態の構造化文書データベースに対し、「特許ＤＢ」アークが示すノード以下に格納されている「特許」情報の文書群において、「１９９９年でかつ、「ＰＣ」のような内容の「要約」という要素をもつ文書（「特許」情報）の「タイトル」を列挙せよ」という検索要求の記述例を示している。
【０１２８】
「ｋｆ：ｆｒｏｍ」タグから始まる要素の記述により、変数「＄ｔ」、「＄ｙ」、「＄ｓ」に、それぞれ「特許」情報の「タイトル」、「年」、「要約」という文書要素の値が代入される。
【０１２９】
「ｋｆ：ｗｈｅｒｅ」タグから始める要素の記述により、変数「＄ｙ」＝「１９９９」という比較がなされる。また、コンポーネント「ＭｙＬｉｋｅ」は変数「＄ｓ」と「ＰＣ」を引数として、「ＰＣ」と類似する値の変数「＄ｓ」を検知するための関数である。
【０１３０】
「ｋｆ：ｆｒｏｍ」タグから始まる要素の記述により、変数「＄ｔ」が出力値として利用される。
【０１３１】
なお、「ｋｆ：ｓｔａｒ」タグは構造の曖昧表現であり、例えば「＜特許＞＜ｋｆ：ｓｔａｒ＞＜年＞」は「タグ名が「特許」である要素の子孫の要素としていずれかに存在し、タグ名が「年」である要素」を意味する。
【０１３２】
図２２に図２１の単純検索のクエリを用いた検索結果を示す。この検索結果もＸＭＬ文書である。
【０１３３】
図２３は、概念検索のクエリの例を示したものである。図２３のクエリは、例えば図１８，図１９に示すような状態の構造化文書データベースに対し、「特許ＤＢ」アークが示すノード以下に格納されている「特許」情報の文書群に対し、「概念ＤＢ」アークが示すノード以下に格納されている「概念」情報を利用して検索するための検索要求の記述例を示している。ここで、概念「周辺装置」の値をもつタグの子要素の値には、概念「ＳＣＳＩ」、「メモリ」、「ＨＤＤ」などがあるものとする。また、図１８には示していないが、各「特許」情報の構成要素には、「キーワード」タグから始める要素も存在するものとする。
【０１３４】
すなわち、図２３のクエリは、「概念「周辺装置」以下の概念のいずれかを「キーワード」という構成要素の要素値としてもつ「特許」情報の「タイトル」を列挙せよ」という検索要求の記述例を示している。
【０１３５】
「ｋｆ：ｆｒｏｍ」タグから始まる要素の記述により、変数「＄ｔ」、変数「＄ｋ」に、それぞれ、「特許」情報の「タイトル」、「キーワード」という要素の値が代入される。また、変数「＄ｘ」は「概念」情報として「周辺装置」の値をもつタグの子要素の値（「ＳＣＳＩ」、「メモリ」、「ＨＤＤ」など）が代入される。
【０１３６】
「ｋｆ：ｗｈｅｒｅ」タグから始める要素の記述により、「＄ｋ」＝「周辺装置」もしくは「＄ｋ」＝「＄ｘ」という比較がなされる。
【０１３７】
次に、図１の構造化文書管理システムの文書検索処理動作について、図２４に示すフローチャートを参照して説明する。
【０１３８】
クライアント端末の所定の表示装置には、構造化文書管理システム１００（の例えば、要求制御部１）から提供された、例えば、図２５に示すようなユーザインターフェイスとしての画面が表示されている。
【０１３９】
図２５に示した画面上で、ユーザが「ＸＭＬ検索Ｗｉｎ」をマウス等のポインティングデバイスなどを用いて選択すると、図２６に示すような文書検索を行うためのユーザインタフェースとしての画面が表示される。
【０１４０】
図２６の検索画面において、領域Ｗ１には、構造化文書データベースの現在のツリー構造の構成要素の要素名（タグ名）がユーザが理解可能なように簡略的に表示されている。なお、図２６では、上位階層の要素名のみを表示しているが、末端の要素名まで表示可能である。
【０１４１】
領域Ｗ１１は、検索対象の範囲（ツリー構造上の検索範囲）や、検索条件などを入力するための領域である。領域Ｗ１２には、検索結果が表示される。
【０１４２】
例えば、「「ｕｉｘ：／／ｒｏｏｔ」以下の「特許」を先頭タグに持つ文書の中から、「タイトル」タグをもつ構成要素の要素値に「文書」という文字列を含み、「１９９８」年以降に作成された文書を検索せよ」という検索要求の場合には、領域Ｗ１から「ｒｏｏｔ」をマウス等で選択して検索対象の範囲として、構造化文書パスを入力する。そして、領域Ｗ１１には、まず、トップノードとして、「特許」を入力する（この場合、領域Ｗ１から「特許」をマウス等で選択することにより入力してもよい）。また、検索条件としての、「「タイトル」という構成要素の要素値に「文書」という文字列を含む」「「年」という構成要素の要素値が「１９９８」以上である」という内容は、予め設けられたデータ入力領域に入力すればよい。
【０１４３】
その後、「検索」ボタンＢ２１を選択することにより、例えば、図２７に示すようなクエリが、当該クエリを構造化文書データベース上に格納するための追加コマンドとともに構造化文書管理システムへ送信される。なお、クエリの格納場所は、予め定められており、システム側が自動的に、この追加コマンドのパラメータを設定することとなる。例えば、構造化文書データベースが図１８に示した状態のとき、当該クエリの格納場所を表すパラメータとしての構造化文書パスは、「ｕｉｘ：／／ｒｏｏｔ／クエリＤＢ」となる。また、追加コマンドのもう一方のパラメータは、当該クエリ文書である。
【０１４４】
要求受付部１１は、上記クエリを受け付けると（ステップＳ１００）、当該クエリを検索要求処理部３へ渡す。そして、当該クエリ文書を格納するための追加コマンドのパラメータを文書格納部２１へ渡す。文書格納部２１では、追加コマンドの処理を行って、当該クエリは、文書記憶部５に格納される（ステップＳ１０１）。
【０１４５】
一方、検索要求処理部３では、受け取ったクエリを基に、データアクセス部４を通してインデックス記憶部６，文書記憶部５にアクセスし、検索要求に合致する文書集合などを取得して、クエリの中で要求された情報を抽出して結果処理部１２を介して出力する。
【０１４６】
例えば、上記クエリの場合、まず、「「タイトル」タグをもつ構成要素の要素値に「文書」という文字列を含む」という条件に合致するものを検索することが検索対象を絞り込む上で効率がよい。そこで、図１０に示したようなデータ生起インデックスを用いて、「文書」という文字列にリンクされているノード（文書オブジェクト）のオブジェクトＩＤを得る。そして、そのそれぞれについて、文書オブジェクトツリーを上流側に１つ遡り、「タイトル」というタグ名にたどり着いたときは、更に上流に辿っていき、「特許」というタグ名にたどり着いたときは、そのノード以下の文書オブジェクトツリーＯｔ１１を抽出する。
【０１４７】
次に、この抽出された複数の文書オブジェクトツリーＯｔ１１の中から、さらに、「年」という構成要素の要素値が「１９９８」年以上の文書オブジェクトツリーＯｔ１２を抽出する。
【０１４８】
この文書オブジェクトツリーＯｔ１２が上記クエリの内容に適合する文書となる。さらに上記クエリの要求内容に従えば、各文書オブジェクトツリーＯｔ１２のトップノードへの構造化文書パスを求める（ステップＳ１０２）。
【０１４９】
なお、上記検索処理は、上記した方法に限るものではなく、インデックス情報を用いた様々な効率のよい検索方法が可能である。
【０１５０】
検索要求処理部３は、ステップＳ１０２で得られた結果を統合して、検索結果としてのＸＭＬ文書を作成する（ステップＳ１０３）。
【０１５１】
例えば、検索結果のＸＭＬ文書は、
＜ｏｕｔ＞
＜ｒｅｓｕｌｔ＞
ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許［０］
＜／ｒｅｓｕｌｔ＞
＜ｒｅｓｕｌｔ＞
ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許［２］
＜／ｒｅｓｕｌｔ＞
＜／ｏｕｔ＞
となる。
【０１５２】
検索要求処理部３は、検索結果処理部１２を介して、上記ＸＭＬ文書をスタイルシートとともに、要求元のクライアント端末に返す（ステップＳ１０４）。
【０１５３】
クライアント端末では、図１１に示したＸＭＬ文書を、スタイルシートを用いてＨＴＭＬデータに変換して、例えば、図２６に示すように、領域Ｗ１２に表示する。ここでは、例えば検索結果として得られた構造化文書の数が多いために、検索された構造化文書の構造化文書パスが検索結果として表示されている。この場合、例えば、図２６の領域Ｗ１２に表示された検索結果の構造化文書パスのうち、所望の１つがユーザにより選択されたとする。例えば、図２６の領域Ｗ１２に表示された構造化文書パスのうち、最初のものが選択されたとする。この場合、クライアント端末から構造化文書管理システムに対し、当該選択された構造化文書パスにより特定される構造化文書を取得するために文書取得要求として、取得コマンドを送信するようにしてもよい。
【０１５４】
取得コマンドが構造化文書管理システムの要求受付部１１にて受け付けられたときの、図１の構造化文書管理システムの文書取得処理動作について、図２８に示すフローチャートを参照して説明する。
【０１５５】
例えば、「ｇｅｔＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許［０］”）」なる取得コマンドが構造化文書管理システムへ送信される。
【０１５６】
ここでは、例えば、構造化文書データベースが、図８に示した状態のときに、「ｇｅｔＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許［０］”）」なる取得コマンドを受け付けた場合を例にとり説明する。
【０１５７】
要求受付部１１は、上記取得コマンドを受け付けると、上記取得コマンド中のパラメータである構造化文書パス「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ／特許［０］」を文書取得部２２へ渡す（ステップＳ３１）。
【０１５８】
文書取得部２２は、文書オブジェクトツリー取得部４３へ構造化文書パスを渡す。文書オブジェクトツリー取得部４３は、構造化文書パスから文書記憶部５中の物理的なエリアを特定することにより、そのエリアに存在する構造化文書パスにて表されたノード（文書オブジェクトＯｘ５）を取り出す（ステップＳ３２）。構造化文書パスの指定が正しければ、文書オブジェクトＯｘ５のオブジェクトＩＤを取得することができるので（ステップＳ３３）、その場合は、ステップＳ３５へ進む。
【０１５９】
例えば、上記取得コマンドの場合、「＃４２」ノードが文書オブジェクトＯｘ５となるので、そのオブジェクトＩＤとして、「＃４２」を取得するとともに、この「＃４２」ノード以下の文書オブジェクトツリーＯｔ５（「＃４２」ノード〜「＃４９」ノード）を取得する（ステップＳ３５）。
【０１６０】
ステップＳ３２において、指定された構造化文書パスからそれに対応する文書オブジェクトＯｘ５が見つからなければ、エラーとなり（ステップＳ３３）、文書取得部２２，結果処理部１２を介して、クライアント端末に「文書取得失敗」の旨のメッセージを返す（ステップＳ３４）。
【０１６１】
ステップＳ３５で取得した文書オブジェクトツリーＯｔ５は、文書文字列取得部４４でＸＭＬ文書に変換される。例えば、上記取得コマンドの場合、取得したＸＭＬ文書は、図３に示すような「特許」情報のＸＭＬ文書となる。
【０１６２】
文書取得部２２は、結果処理部１２を介して、図３に示したようなＸＭＬ文書を（例えば、ＸＳＬ（ｅＸｔｅｎｓｉｂｌｅＳｔｙｌｅＬａｎｇｕａｇｅ）といった所定のスタイルシートとともに）、クライアント端末へ返す（ステップＳ３７）。
【０１６３】
クライアント端末では、図３に示したＸＭＬ文書を、スタイルシートを用いてＨＴＭＬデータに変換して、例えば、図２９に示すように、領域Ｗ１３に表示する。
【０１６４】
ＸＳＬを利用すると、ＸＭＬ文書を様々な形に変換することが出来る。違う構文書造のＸＭＬ文書に変換することも出来るし、ＸＭＬ文書からＨＴＭＬページを生成することも出来る。
【０１６５】
同様にして、スキーマの検索も行える。
【０１６６】
例えば、「「ｕｉｘ：／／ｒｏｏｔ」以下の「ｓｃｈｅｍａ」を先頭タグに持つ文書の中から、「特許」と「要約」というタグ名を持つスキーマを検索せよ」という検索要求の場合には、図３０に示すように、領域Ｗ１から「ｒｏｏｔ」をマウス等で選択して検索対象の範囲として、構造化文書パスを入力する。そして、トップノードとして、「＃ｓｃｈｅｍａ」を入力する。また、検索条件として、「構成要素の属性名に「特許」という文字列を含む」「構成要素の属性名に「要約」という文字列を含む」という内容を予め設けられたデータ入力領域に入力すればよい。
【０１６７】
その後、「検索」ボタンＢ２１を選択することにより、上記検索要求を記述した、例えば、図３１に示したようなクエリが、当該クエリを構造化文書データベース上に格納するための追加コマンドとともに構造化文書管理システムへ送信される。
【０１６８】
さて、上記クエリの場合、例えば、「「＃ｓｃｈｅｍａ」を先頭タグに持つ」という条件に合致するものを検索する。そこで、図９に示したような要素名称生起インデックスを用いて、「＃ｓｃｈｅｍａ」という要素にリンクされているノードの（文書オブジェクト）のオブジェクトＩＤを得る。そして、そのそれぞれについて、文書オブジェクトツリーを下流側にアークを辿っていき、属性名が「特許」と「要約」いう要素にたどり着いたときは、当該「＃ｓｃｈｅｍａ」を先頭タグにもつ文書オブジェクトツリーＯｔ２１を抽出する。この文書オブジェクトツリーＯｔ２１が上記クエリの内容に適合する文書となる。さらに、図３１に示したクエリの要求内容に従えば、各文書オブジェクトツリーＯｔ２１のトップノードへの構造化文書パスを求める。
【０１６９】
検索要求処理部３は、文書オブジェクトツリーＯｔ２１が複数あれば、それぞれのトップノードへの構造化文書パスをまとめて、検索結果としてのＸＭＬ文書を作成し、検索結果処理部１２を介して、上記ＸＭＬ文書をスタイルシートとともに、要求元のクライアント端末に返す。
【０１７０】
クライアント端末では、検索結果として受け取ったＸＭＬ文書を、スタイルシートを用いてＨＴＭＬデータに変換して、例えば、図２６に示すように、領域Ｗ１２に表示する。
【０１７１】
クライアント端末では、検索結果の中の１つのスキーマを選択して、表示させると、例えば、図３２に示すような文書の格納／削除を行うための画面とともに、その領域Ｗ３に、「特許」情報のデータ入力領域が各要素毎に設定されて表示される。
【０１７２】
ユーザは、このデータ入力領域にデータを入力することで、スキーマにより定義された文書構造の格納文書が容易に作成することができる。
【０１７３】
例えば、図３２の領域Ｗ３に入力した「特許」情報の格納先として、領域Ｗ１で「特許ＤＢ」をマウス等を用いて選択すると、領域Ｗ２に構造化文書パスとして、「ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ」が表示される。その後、「登録」ボタンＢ１を選択すると、「ａｐｐｅｎｄＸＭＬ（“ｕｉｘ：／／ｒｏｏｔ／特許ＤＢ”，“＜特許＞…＜／特許＞”）」なる追加コマンドが構造化文書管理システムへ送信される。
【０１７４】
以上説明したように、図１の構造化文書管理システムでは、構造化文書データベース上に登録される文書構造が異なる膨大な数のＸＭＬ文書群（コンテンツ文書、スキーマ文書、クエリ文書など）を、図１８，図１９に示すように、「ｒｏｏｔ」タグを先頭に持つツリー状の１つの巨大なＸＭＬ文書として取り扱う。従って、文書構造が異なる、様々なスキーマを持つ膨大な数の文書の中から検索条件に合致する文書を容易に検索できる。
【０１７５】
また、検索に用いるクエリも構造化文書であるので、構造化文書データベースにログとして格納することにより、過去のクエリを再利用するようなアプリケーションも容易に構築することができる。
【０１７６】
（文書構造検索）
以下、前述の構造化文書管理システムの基本的な構成および処理動作の説明にさらに追加するかたちで、本発明の実施形態に係る、文書構造（スキーマ）検索のための構成と処理動作について説明する。
【０１７７】
図１に示すように、構造化文書管理システム１００は、スキーマ検索を行うために、スキーマ検索処理部３０３と、データアクセス部４の中にテンプレート処理部３０６を具備し、さらに、テンプレート記憶部４０４と意味ネットワーク記憶部４０５とを具備している。
【０１７８】
テンプレート処理部３０６では、構造化文書データベース（以下、ＸＭＬデータベースとも呼ぶ）の階層化された論理構造を管理するために、当該論理構造を表現したテンプレートツリーを作成する。このテンプレートツリーは、新たなＸＭＬ文書を格納する度に更新される。
【０１７９】
ここで作成されるテンプレートツリーは、テンプレート記憶部４０４に格納されている。
【０１８０】
クライアント端末１０２はスキーマ検索要求文（クエリ）を構造化文書管理システム１００へ送信する。このスキーマ検索要求文は、構造化文書管理システム１００の要求受付部１１で受け取られ、ここから当該スキーマ検索要求文はスキーマ検索処理部３０３へ渡される。
【０１８１】
スキーマ検索処理部３０３は、スキーマ検索要求文を基に、意味ネットワーク記憶部４０５に記憶されている意味ネットワークやテンプレート記憶部４０４に記憶されているテンプレートを参照しながらスキーマ検索するためのものである。
【０１８２】
（１）テンプレートツリーの更新
まず、テンプレート処理部３０６における、文書格納時におけるテンプレートツリーの更新処理動作について説明する。
【０１８３】
前述したように、文書格納部２１は、追加コマンドを実行して、登録対象のＸＭＬ文書を文書記憶部５に登録する際、テンプレート処理部３０６に、登録対象のＸＭＬ文書の格納先のデータベース上の構造化文書パスと、当該登録対象のＸＭＬ文書の文書構造の情報を通知する。
【０１８４】
図３３は、テンプレート処理部３０６の構成例を示したものである。
【０１８５】
テンプレート処理部３０６は、まず、当該登録対象のＸＭＬ文書の文書構造の情報から、構造抽出部３０６ａにおいて、構成要素の要素名や属性名などの当該ＸＭＬ文書の構造情報を抽出する。それと構造化文書パスとを基に、テンプレートツリー処理部３０６ｂは、テンプレート記憶部４０４に記憶されている、ＸＭＬデータベース中の論理構造を表したテンプレートツリーを更新する。
【０１８６】
テンプレートツリーは、ＸＭＬデータベースに格納されているＸＭＬ文書の文書構造を抽出して、ＸＭＬデータベースの階層構造に従ってツリー化したもので、各ノード（各構成要素）にはテンプレートＩＤが割り当てられて記憶される。前述したように、文書格納部２１には、登録対象のＸＭＬ文書が、その各構成要素に対して、これらオブジェクトＩＤが付加されて格納されているが、さらに、テンプレートＩＤも付加されて格納されている。登録対象のＸＭＬ文書の文書構造を構成する全ての構成要素の要素名や、属性名などは、このテンプレートＩＤが割り当てられて、テンプレートツリー上に記憶される。
【０１８７】
例えば、更新前には、図３４に示すようなテンプレートツリーがテンプレート記憶部４０４に記憶されている場合を考える。この場合、ＸＭＬデータベースの状態も図３４と同様である。このとき、ＸＭＬデータベースに、図３５（ａ）に示すようなＸＭＬ文書６０１を、「／文献ＤＢ／特許ＤＢ」以下に登録し、図３５（ｂ）に示すようなＸＭＬ文書６０２を「／人事ＤＢ／従業員ＤＢ」以下に登録する場合を考える。ここで、「／文献ＤＢ／特許ＤＢ」という表現は、ＸＭＬデータベース中の構成要素の位置を表現する際に用いた構造化文書パスと同様であり、ここでは、テンプレートツリー上の構成要素の位置を表現するためにも用いられる。
【０１８８】
テンプレートツリーは、少なくとも、ＸＭＬデータベースに格納されているＸＭＬ文書のそれぞれの文書構造とＸＭＬデータベースの階層構造上の格納位置とを含む論理構造を記憶するものである。
【０１８９】
なお、ここで、各構成要素の要素名や属性名に与えられるテンプレートＩＤは、図３４〜図３５における各構成要素の要素名（タグ名）や属性名に対し振られている符号ＴＩＤ１〜ＴＩＤ１３と同一であるとする。
【０１９０】
この場合、全ての要素がテンプレートツリー上のノードとして新規要素であるので、図３６に示すように、テンプレートツリーが更新される。
【０１９１】
また、文書構造が全く同一、あるいは文書構造の一部が同一の複数のＸＭＬ文書が、同じ構造化文書パスで表される格納位置に格納される場合、テンプレートツリー（論理構造）上では、２つ以上のＸＭＬ文書で共通する構成要素は１つに集約して、１つの構成要素として表す。この場合、テンプレートツリー上の構成要素のテンプレートＩＤと、ＸＭＬデータベース上の各構成要素のオブジェクトＩＤとの対応関係を管理する必要がある。そのために、例えば、テンプレートツリー上で、各構成要素のテンプレートＩＤと、ＸＭＬ文書中の当該構成要素のオブジェクトＩＤとを記録するようにしてもよいし、テンプレートＩＤとオブジェクトＩＤとの対応関係を記録するテーブルを別途設けてもよい。
【０１９２】
なお、実際には、テンプレートツリーは、図３７に示すようなＸＭＬ文書としてテンプレート記憶部４０４に記憶されている。これをここでは、テンプレートツリー文書と呼ぶ。
【０１９３】
テンプレートツリー文書は、図３７に示すように、例えば、先頭に「ｘｓｐ：Ｎｏｄｅ」を要素名とする構成要素で階層化されたＸＭＬ文書である。このテンプレートツリー文書にも記述されているように、ＸＭＬデータベースにＸＭＬ文書を登録する際には、当該登録対象のＸＭＬ文書のルートノード（先頭の構成要素の要素名）に対しては「ｄｏｃｕｍｅｎｔ」、それ以外の各要素の要素名に対しては「ｅｌｅｍｅｎｔ」、属性を表す構成要素に対しては「ａｔｔｒｉｂｕｔｅ」を、「ｘｓｐ：Ｎｏｄｅ」のｔｙｐｅ属性の属性値として付加されている。
【０１９４】
また、フォルダ作成の場合は、「ｆｏｌｄｅｒ」，ＸＭＬデータベースのルートとしては「ｒｏｏｔＥｌｅｍｅｎｔ」をｔｙｐｅ属性として付加する。更に、登録対象のＸＭＬ文書中の構成要素のうち、毎回登録されている構成要素は、その文書構造上必須である可能性が高いので、「ｒｅｑｕｉｒｅｄ」という属性で必須であることを記録しておく（必須である場合には当該属性の属性値を「ｙｅｓ」とする）。また、同一文書上に１つ以上の要素が出現するのは、回数制限がない（ｍａｘＯｃｃｕｒｓ＝「＊」）という、簡単なスキーマ抽出を行っておく。また、型推定ステップを行い、候補の値（「ｄａｔａＴｙｐｅ」）を書いておくことも可能である。これは、候補外の値がこの要素に来た場合はスルーさせる。
【０１９５】
また、テンプレートツリー上の１つの構成要素に集約された実際の構成要素の数（オブジェクトＩＤの数）をｃｏｕｎｔ属性として記述してもよい。なお、図３７には、ｃｏｕｎｔ，ｔｙｐｅの属性のみを表している。
【０１９６】
文書格納部２１で追加コマンドを実行して、登録対象のＸＭＬ文書を登録する際には、前述したように、当該ＸＭＬ文書に対応するインデックスが作成される。
【０１９７】
（２）スキーマ検索処理の概略
クライアント端末１０２はスキーマ検索要求文を構造化文書管理システム１００へ送信する。このスキーマ検索要求文は、構造化文書管理システム１００の要求受付部１１で受け取られ、ここから当該スキーマ検索要求文（スキーマクエリ）はスキーマ検索処理部３０３へ渡される。
【０１９８】
図３８は、スキーマ検索処理部３０３の構成例を示したものである。
【０１９９】
スキーマ検索処理部３０３は、スキーマクエリ解析部３２１、スキーマクエリ条件処理部３２２、スキーマクエリ出力処理部３３３から構成される。
【０２００】
スキーマクエリ解析部３２１は、スキーマを検索するための検索条件などの記述された検索要求文、すなわちスキーマクエリを解析する。クエリの記述言語としては、例えば、ＸＱｕｅｒｙなどがある。図３９は、スキーマクエリの一具体例を示したものである。なお、スキーマ検索は、スキーマを推定したデータを擬似的に作成し、検索を行う必要があるので、スキーマクエリは、通常のＸＱｕｅｒｙなどでは表現はできない。
【０２０１】
スキーマクエリは、「ｘｓｐ：ｑｕｅｒｙ」タグをもつ構成要素で始まり、その中に、主に、「ｘｓｐ：ｒｅｔｒｕｎ」というタグをもつ構成要素で記述される部分（以下、簡単にｒｅｔｕｒｎ節とも呼ぶ）と「ｘｓｐ：ｗｈｅｒｅ」というタグの構成要素で記述される部分（以下、簡単にｗｈｅｒｅ節とも呼ぶ）が含まれている。
【０２０２】
先頭にある「ｘｓｐ：ｑｕｅｒｙ」タグが存在することで、要求受付部１１は、当該クエリがスキーマクエリであることが判断できる。
【０２０３】
「ｘｓｐ：ｒｅｔｕｒｎ」タグをもつ構成要素には、抽出すべき文書構造の先頭の構成要素の要素名もしくは属性名などに対応する文字列（キーワード）が記述されている。この「ｘｓｐ：ｒｅｔｕｒｎ」節中の要素は全て、「ｘｓｐ：ｅｌｅｍ」である。ここには複数の候補を記述しても良く、例えば、関連が近いと想定される、要素名、属性名をｎａｍｅ属性で記述する。図３９に示したスキーマクエリにおいては、「特許」と「文献」がそれぞれ記述されている。
【０２０４】
この場合、完全一致以外の曖昧な揺らぎも考慮される。「特許」のスキーマとして記述されていなくて、例えば、「Ｐａｔｅｎｔ」で定義されるようなスキーマも同様に検索するためである。
【０２０５】
この揺らぎを吸収するものとして、意味ネットワーク記憶部４０５には図４１に示すような意味ネットワークを記憶している。意味ネットワークとは、図４１に示すように、語彙の間の関係をグラフで表現したものであり、語彙間のウエイト（類似度）をアーク上に付加させている。例えば、「構造化文書」や「ＸＭＬ」の類似度は「０．８」となっている。類似度を表す数値は「０」から「１」までの値を取る。
【０２０６】
類似度計算のアルゴリズムを図４２に示す。
【０２０７】
ここでは、図４２に示すフローチャートを参照して、「ＸＭＬ」に類似する語彙を類似度を計算しながら求める場合を例にとり説明する。
【０２０８】
ここで、文字列「ＸＭＬ」という語彙に類似すると判断できる類似度の閾値を例えば「０．６」とする。まず、「ＸＭＬ」の意味ネットワーク上の位置を検索する（ステップＳ２０１）。検索位置のウェイト（類似度）を「１．０」に設定する（ステップＳ２０２）。その後、まず、現キーワードを探索候補リスト及び確定候補リストに追加する（ステップＳ２０３）。
【０２０９】
図４１に示した意味ネットワークから、「ＸＭＬ」に類似する語彙「構造化文書」、「マークアップ言語」が取り出され、それぞれ類似度は「０．８」である。この場合、展開元の語彙「ＸＭＬ」のウエイト「１．０」と、取り出された各語彙のウエイトを乗じて、それぞれ「０．８」が得られる。探索候補リストから、展開元の語彙「ＸＭＬ」を削除して、この２つの語彙を追加する（ステップＳ２０５）。
【０２１０】
この２つの語彙の類似度は、「０．８」であるので、確定候補リストにも追加する（ステップＳ２０６〜ステップＳ２０７）。
【０２１１】
次に、探索候補リスト上の「構造化文書」を展開元として、図４１に示した意味ネットワークを１段展開すると、「構造文書」の類似度は、展開元との間の類似度「０．５」と展開元のウエイト「０．８」を乗じて「０．４」が得られる。今、閾値を「０．６」に設定しているのでこれ以後の展開は行わない。以上の作業を繰り返すことにより、結果として、「ＸＭＬ」（１．０）、「マークアップ言語」（０．８）、「ＳＧＭＬ」（０．６４）、「ＨＴＭＬ」（０．６４）、「半構造化文書」（０．６４）が確定候補リスト上に残ることになる。
【０２１２】
図３９に示したスキーマクエリでは、上記類似度の閾値を表すために、当該閾値を、「ｘｓｐ：ｅｌｅｍ」の属性として記述している。すなわち、ここでは、「ｓｉｍ」という属性名を用いて、このｓｉｍ属性の属性値として記述している。この属性値により閾値を設定しておくと、ここで定めた閾値以下の類似度を持った語彙を求めるための意味ネットワーク上の展開を抑制することができる。なお、この閾値は、ユーザが例えば検索条件を入力する際に設定するようにしてもよいが、この場合に限らず、スキーマクエリ上に予め書き込まれたデフォルト値を用いてもよい。また、スキーマクエリに記述されていなくとも、スキーマ検索処理部３０３で予め定められた値を用いてもよい。
【０２１３】
スキーマを検索する際は、抽出部分すら分からない場合がある。この場合は、スキーマクエリ中に「＜ｘｓｐ：ｒｅｔｕｒｎｔｙｐｅ＝“ａｌｌ”／＞」を記述しておけば、自動的にスキーマ検索処理部３０３が文書のルートノードを判別し（すなわち、前述したように、テンプレートツリー上では、１つのＸＭＬ文書の先頭の構成要素には「ｄｏｃｕｍｅｎｔ」という属性値が記述されているので、これをチェックすることによりルートノードを判別することができる）、テンプレートツリーから、当該ルートノードを先頭とする文書構造を取り出して（擬似的なスキーマを作成し）、返すことが可能となる。
【０２１４】
図３９のスキーマクエリの説明にもどり、「ｘｓｐ：ｗｈｅｒｅ」タグをもつ構成要素には、検索条件として、構成要素の要素名や属性名に含まれる文字列（キーワード）が記述されている。この節においても曖昧な状態を残したまま、検索条件が記述されている。
【０２１５】
例えば、要素名か属性名かどちらか分からない場合は、「＜ｘｓｐ：ｅｌｅｍｔｙｐｅ＝“ｅｌｅｍｅｎｔ”／＞」、要素値が属性値か分からない場合は、「＜ｘｓｐ：ｅｌｅｍｔｙｐｅ＝“ｗｏｒｄ”／＞」で表現する。要素名か、属性名か、要素値か、属性値か全くわからない場合もあり、その場合は、「＜ｘｓｐ：ｅｌｅｍｔｙｐｅ＝“ａｌｌ”／＞」を記述しておく。
【０２１６】
すなわち、ユーザにより、ある文字列が指定された場合、それを要素名（あるいは属性値でもよい）にもつスキーマを検索するための検索条件とするときには、「＜ｘｓｐ：ｅｌｅｍｔｙｐｅ＝“ｅｌｅｍｅｎｔ”／＞」と記述し、当該文字列を要素値（あるいは属性値）にもつスキーマを検索するための検索条件とするときには、「＜ｘｓｐ：ｅｌｅｍｔｙｐｅ＝“ｗｏｒｄ”／＞」と記述し、当該文字列を要素名（あるいは属性名）と要素値（あるいは属性値）のうちのいずれか一方にもつスキーマを検索するための検索条件とするときには、「＜ｘｓｐ：ｅｌｅｍｔｙｐｅ＝”ａｌｌ”／＞」を記述する。
【０２１７】
スキーマを表す名前はｎａｍｅ属性で記述される。ここには、図４０に示したスキーマクエリでは、ｄａｔａＴｙｐｅ属性などで「ｉｎｔ」と、整数型という、型指定を明記した検索も行える。同様に、展開する近さに関してはｓｉｍで記述する。
【０２１８】
次に、図４３に示すフローチャートを参照して、スキーマ検索処理部３０３におけるスキーマ検索の概略的な処理の流れを説明する。
【０２１９】
スキーマ検索処理部３０３にスキーマクエリは、まず、ｗｈｅｒｅ節の処理を行い（ステップＳ３０１）、次に、ｒｅｔｕｒｎ節の処理を行った後（ステップＳ３０２）、それらの処理結果をマージし（ステップＳ３０３）、その結果に対してスキーマ推定処理を行い、すなわち、スキーマを抽出し（ステップＳ３０４）、それら抽出された各スキーマのスコアリング処理に基づくソート処理を行い（ステップＳ３０５）、最後に出力形式を整形して（ステップＳ３０６）、その結果を返す。
【０２２０】
次に、具体例を挙げて、スキーマ検索について、より詳細に説明する。
【０２２１】
（３）具体例
クライアント端末１０２には、図４４に示すようなスキーマ検索のための検索条件の入力画面が表示される。この入力画面上で、ユーザは、検索したいスキーマとして、例えば、「ＸＭＬ関連の特許で著者が記述されているようなスキーマ」という内容の自然文を入力したとする。
【０２２２】
この入力された内容は、クライアント端末１０２では、例えば、自然文解析により、図４５に示すようなスキーマクエリに変換されて、構造化文書管理システム１００へ送信される。
【０２２３】
なお、ここでは、ユーザは、スキーマ検索のために自然文を入力するとしたが、この場合に限らず、例えば、（条件１）どのような文字列を要素名（あるいは属性値）、要素値（あるいは属性値）として含むのか、（条件２）どのような種類の文書の文書構造を抽出したいのか、といった内容が含まれていれば、自然文で入力する必要はない。また、上記条件１、条件２を入力するための別個の入力領域が設けられていてもよい。また、条件１のみを入力する場合であってもよい。条件１を入力する場合、少なくとも１つの文字列が入力されていればよく、複数の文字列をカンマなどで区切りながら入力してもよい。
【０２２４】
また、条件１として指定される文字列は、構成要素の要素名である場合と、属性名である場合と、要素値である場合と、属性値である場合とがあるが、これらを別個の入力領域から入力するようにしてもよいし、１つの入力領域のみを設けて、そこに入力された文字列はこれらのうちのいずれか１つであると予め定めておいて、スキーマクエリを作成してもよい。また、条件２として入力される文字列も構成要素の要素名である場合と、属性名である場合とがあるが、これらを別個の入力領域から入力するようにしてもよいし、１つの入力領域のみを設けて、そこに入力された文字列はこれらのうちのいずれか１つであると予め定めておいて、スキーマクエリを作成してもよい。
【０２２５】
また、条件１のみであってもよいし、条件２のみであってもよい。
【０２２６】
クライアント端末１０２では、上記条件１、条件２に対応する文字列を代入すれば完成するようなスキーマクエリの雛形を複数種類（例えば、条件１と条件２が指定された場合のスキーマクエリ、条件２のみが指定された場合のスキーマクエリ、条件２に複数の文字列が指定された場合のスキーマクエリなど）記憶しておき、入力（指定）された条件に応じてこれらのうちの１つを選択し、空欄部分に条件１や条件２に対応する文字列を代入することで、スキーマクエリを作成するようにしてもよい。
【０２２７】
また、クライアント端末１０２では、ユーザにより入力された、上記条件１や条件２に対応する文字列や自然文をそのまま構造化文書管理システムへ検索要求として送信するようにしてもよい。そして、この場合には、このような検索要求を受信した構造化文書管理システムの要求受付部１１において、上記スキーマクエリを作成するようにしてもよい。要求受付部１１にてスキーマクエリを作成する場合も、前述同様である。
【０２２８】
図４４に示した入力内容の場合、条件１に対応する文字列は、「ＸＭＬ」と「著書」であり、前者は、例えば要素値に含まれる文字列であり、後者は例えば要素名あるいは属性名に含まれる文字列である。また、条件２に対応する文字列は、「特許」であり、例えば要素名に含まれる文字列であると判断され、その結果、図４５に示すようなスキーマクエリが作成される。
【０２２９】
当該スキーマクエリは、構造化文書管理システム１００の要求受付部１１を経由して、スキーマ検索処理部３０３に渡される。あるいは、要求受付部１１にて作成されて、スキーマ検索処理部３０３に渡される。
【０２３０】
まず、スキーマクエリ解析部３２１において、スキーマクエリのｗｈｅｒｅ節の解析を行う（図４３のステップＳ３０１）。この処理動作について、図５４に示すフローチャートを参照して説明する。
【０２３１】
このｗｈｅｒｅ節では、上記条件１に対応する文字列が、１つずつ「ｘｓｐ：ｅｌｅｍ」という要素名（タグ名）の構成要素の記述されている。構成要素の要素名（あるいは要素名と属性名のうちのいずれか一方）に含まれる文字列（例えば、ここでは、「著書」）を表すときは、当該文字列に対し、「ｔｙｐｅ＝“ｅｌｅｍｅｎｔ”」と指定されている。また、構成要素の要素値（あるいは要素値と属性値のうちのいずれか一方）に含まれる文字列（例えば、ここでは、「ＸＭＬ」）を表すときは、当該文字列に対し、「ｔｙｐｅ＝“ｗｏｒｄ”」と指定されている。また、構成要素の要素名（あるいは要素名と属性名のうちのいずれか一方）と、構成要素の要素値（あるいは要素値と属性値のうちのいずれか一方）のうちのいずれかに含まれる文字列を表すときは、当該文字列に対し、「ｔｙｐｅ＝“ａｌｌ”」と指定されている。
【０２３２】
まず、ｗｈｅｒｅ節内の構成要素を１つづつ取り出して（ステップＳ４０１）、各要素が、上記のいずれかであるかをチェックする（ステップＳ４０２〜ステップＳ４０４）。
【０２３３】
図４５に示したスキーマクエリの場合、１つ目の要素では、「著者」という文字列に対して「ｔｙｐｅ＝“ｅｌｅｍｅｎｔ”」と記述されているので（ステップＳ４０２）、ステップＳ４０５へ進み、例えば、ここでは、「著書」という文字列とこれに類似する語彙の文字列を要素名として持つ（要素名に一致あるいは含む）構成要素を検索する処理を行う（構造推定処理）。
【０２３４】
また、２つ目の要素では、「ＸＭＬ」という文字列に対して、「ｔｙｐｅ＝“ｗｏｒｄ”」と記述されているので（ステップＳ４０３）、ステップＳ４０６へ進み、例えば、ここでは、「ＸＭＬ」という文字列とこれに類似する語彙の文字列を要素値として持つ（要素値に一致あるいは要素値に含む）構成要素を検索する処理を行う（語彙推定処理）。
【０２３５】
また、ｗｈｅｒｅ節内の構成要素として、「ｔｙｐｅ＝“ａｌｌ”」と記述された文字列が存在する場合には（ステップＳ４０４）、ステップＳ４０５とステップＳ４０６と同様にして、当該文字列とこれに類似する文字列を要素名や要素値として持つ構成要素を検索する処理を行う（ステップＳ４０７，ステップＳ４０８）。
【０２３６】
以上ステップＳ４０１〜ステップＳ４０８の処理をｗｈｅｒｅ節内の全ての構成要素について繰り返す。
【０２３７】
ここで、処理対象の文字列を「著者」とした場合を例にとり、ステップＳ４０５、ステップＳ４０７の構造推定処理について、図５５に示すフローチャートを参照して説明する。
【０２３８】
まず、意味ネットワークを利用して「著書」の類義語を抽出する（ステップＳ４１１）。なお、ここでは、類似度の閾値が指定されていない（ｓｉｍ属性が記述されていない）ので、予め決められた閾値以下（ここでは「０．６」に設定）になるまで、ネットワークを展開する。図４１に示したの意味ネットワークの場合、「著者」（類似度は「１．０」）、「ａｕｔｈｏｒ」（類似度は「０．８」×「０．８」）、「名前」（類似度は「０．７」）という語彙が得られる。
【０２３９】
次に、前述した文書の検索の場合と同様にして、インデックス記憶部６に記憶されている要素名称生起インデックスを検索して、処理対象の文字列と上記抽出された類似語のそれぞれを要素名としてもつ構成要素のオブジェクトＩＤを取得し、さらに、例えば、前述したようなテーブルあるいはテンプレートツリー上の記述を基に、この取得した各オブジェクトＩＤのそれぞれに対応するテンプレートＩＤを取得する（ステップＳ４１２）。
【０２４０】
そして、各テンプレートＩＤに対して評価値（第１の評価値）を求める（ステップＳ４１３）。この第１の評価値は、取得したテンプレートＩＤに対応する構成要素が、検索条件として指定された文字列（語彙）にどれだけ類似するかを評価するものであれば、何でもよい。
【０２４１】
例えば、現在のＸＭＬデータベースが図４６に示すような状態であるとき、ステップＳ４１２において取得されるテンプレートＩＤは、「ＴＩＤ７」、「ＴＩＤ１５」、「ＴＩＤ５７」の３通りであり、これらの評価値は、例えば、その要素名に含まれていた語彙の類似度をそのまま用いて、例えば、「ＴＩＤ７」の評価値は「１．０」、「ＴＩＤ１５」の評価値は「０．７」、「ＴＩＤ５７」の評価値は「０．６４」とする。各テンプレートＩＤとその評価値をリストＬ１に記録しておく。このとき、この取得したテンプレートＩＤのそれぞれについて、当該テンプレートＩＤに対応するオブジェクトＩＤと、その数もリストＬ１に記録する。
【０２４２】
次に、処理対象の文字列を「ＸＭＬ」とした場合を例にとり、ステップＳ４０６，ステップＳ４０８の語彙推定処理について、図５６に示すフローチャートを参照して説明する。
【０２４３】
まず、意味ネットワークを用いて、「ＸＭＬ」の類似語を抽出する（ステップＳ４２１）。この場合、「ＸＭＬ」（類似度「１．０」）、「ＳＧＭＬ」（類似度「０．７２」）、「構造化文書」（類似度「０．８１」）が得られる（ステップＳ４２１）。
【０２４４】
次に、前述した文書の検索の場合と同様にして、インデックス記憶部６に記憶されているデータ生起インデックスを検索して、処理対象の文字列と上記抽出された類似語の各語彙のそれぞれを要素値に持つ構成要素のオブジェクトＩＤを取得し、さらに、例えば、前述したようなテーブルあるいはテンプレートツリー上の記述を基に、この取得した各オブジェクトＩＤのそれぞれに対応するテンプレートＩＤを取得する（ステップＳ４２２）。
【０２４５】
そして、各テンプレートＩＤに対して評価値（第２の評価値）を求める（ステップＳ４２３）。この第２の評価値は、取得したテンプレートＩＤに対応する構成要素が、検索条件として指定された文字列（語彙）にどれだけ類似するか、また、検索条件として指定された文字列（語彙）とその類似語とのうちのいずれかをどれだけ多く含む構成要素であるかなどを評価するものであれば、何でもよい。
【０２４６】
取得されたテンプレートＩＤのそれぞれに対応する構成要素が、上記複数の語彙のうちの一部あるいは全ての複数の語彙を要素値としてもつ場合もある。また、この取得したテンプレートＩＤに対応する構成要素のオブジェクトＩＤがデータ生起インデックスから複数個取得されている場合もある。
【０２４７】
ここでは、例えば、取得された各テンプレートＩＤについて、当該テンプレートＩＤに対応する構成要素が要素値に上記複数の語彙のうちのどの語彙を含むのかを基に上記評価値を与える。なお、各テンプレートＩＤに対応する（データ生起インデックスから取得された）オブジェクトＩＤの数と語彙の類似度を掛け合わせて求めてもよい。
【０２４８】
例えば、現在のＸＭＬデータベースが図４６に示すような状態であるとき、ステップＳ４２２では、「タイトル」という要素名を持つ構成要素（テンプレートＩＤ「ＴＩＤ６」）と、その他に「ＴＩＤ１７」、「ＴＩＤ５８」というテンプレートＩＤが得られたものとする。
【０２４９】
テンプレートＩＤ「ＴＩＤ６」の「タイトル」という構成要素は、「ＸＭＬ」という語彙（類似度は「１．０」）と「ＳＧＭＬ」という語彙（類似度は「０．７２」）とから検索されているので、例えば、これら語彙の類似度を加算して、１＋０．７２＝１．７２を評価値として与える。例えば、「ＸＭＬ」という語彙で検索されたオブジェクトＩＤの数が１００件で、「ＳＧＭＬ」という語彙で検索されたオブジェクトＩＤの数が５０件であれば、１×１００＋０．７２×５０＝１３６を評価値としてもよい。以下の説明では、評価値の算出は、説明の簡単のため、前者の例をとることにする。
【０２５０】
テンプレートＩＤ「ＴＩＤ１７」の構成要素は、「ＸＭＬ」という語彙のみから検索されているので、この語彙の類似度「１．０」を評価値として与える。
【０２５１】
テンプレートＩＤ「ＴＩＤ５８」の構成要素は、「ＸＭＬ」という語彙のみから検索されているので、この語彙の類似度「１．０」を評価値として与える。
【０２５２】
各テンプレートＩＤとその評価値（第２の評価値）をリストＬ２に記録しておく。このとき、この取得したテンプレートＩＤのそれぞれについて、当該テンプレートＩＤに対応するオブジェクトＩＤと、その数もリストＬ２に記録する。
【０２５３】
以上のｗｈｅｒｅ節の処理から、以下のようなテンプレートＩＤが得られたことになる。なお、括弧内には、「Ｄ」から始まる数字で第２の評価値を示し、「Ｓ」から始まる数字で第１の評価値を示している。
【０２５４】
｛ＴＩＤ６（Ｄ１．７２）、ＴＩＤ７（Ｓ１．０）、ＴＩＤ１５（Ｓ１．０）、ＴＩＤ１７（Ｄ１．０）、ＴＩＤ５７（Ｓ０．６４）、ＴＩＤ５８（Ｄ１．０）｝
次に、スキーマクエリ解析部３２１において、スキーマクエリのｒｅｔｕｒｎ節の解析を行う（図４３のステップＳ３０２）。この処理動作について、図５７に示すフローチャートを参照して説明する。
【０２５５】
ｒｅｔｕｒｎ節には、抽出すべき文書構造の先頭の構成要素の要素名や属性名が、１つずつ「ｘｓｐ：ｅｌｅｍ」という要素名（タグ名）の構成要素に記述されている。図４５に示したスキーマクエリでは、要素名に含まれる文字列として、「特許」が指定されている。
【０２５６】
まず、ｒｅｔｕｒｎ節から要素（「ｘｓｐ：ｅｌｅｍ」という要素名の構成要素）を１つずつ取出し（ステップＳ４３１）、当該要素に記述されている文字列、例えば、ここでは「特許」を処理対象の文字列として、図５５に示したように、構造推定処理を行う（ステップＳ４３２）。上記ステップＳ４３１〜ステップＳ４３２をｒｅｔｕｒｎ節内に記述された全ての要素について行う（ステップＳ４３３）。
【０２５７】
構造推定処理では、前述同様にして、まず、処理対象の文字列の類似語を意味ネットワークを用いて抽出する。
【０２５８】
この場合、「特許」（類似度「１．０」）、「Ｐａｔｅｎｔ」（類似度「０．８」）が抽出される。
【０２５９】
例えば、現在のＸＭＬデータベースが図４６に示すような状態であるとき、上記２つの語彙のいずれかを要素名にもつ構成要素を要素名称生起インデックスから求めると、以下のような３つのテンプレートＩＤが得られる。
【０２６０】
｛ＴＩＤ４（Ｓ１.０）、ＴＩＤ５５（Ｓ０．８）、ＴＩＤ７２（Ｓ１．０）｝なお、図４６では、上記３つのテンプレートＩＤに対応する構成要素（ノード）を丸印で囲み示している。
【０２６１】
次に、図４３のステップＳ３０３の処理を行う。すなわち、ｗｈｅｒｅ節を処理した結果得られた構成要素と、ｒｅｔｕｒｎ節を処理した結果得られた構成要素とから、テンプレートツリーから１つのＸＭＬ文書の文書構造の先頭となる構成要素を抽出する。具体的には、ｒｅｔｕｒｎ節の処理により抽出された構成要素と、ｗｈｅｒｅ節の処理により抽出された構成要素とのテンプレートツリー上の親子関係をチェックする。
【０２６２】
ｗｈｅｒｅ節の処理により抽出される構成要素は、文書構造の下流にある構成要素であり、ｒｅｔｕｒｎ節の処理により抽出される構成要素は、文書構造の先頭の構成要素である。
【０２６３】
例えば、ｒｅｔｕｒｎ節の処理により抽出された構成要素に対応するノード以下に、ｗｈｅｒｅ節の処理により抽出された構成要素が存在する場合には、ｒｅｔｕｒｎ節の処理により抽出された当該構成要素を、抽出すべき文書構造の先頭の構成要素の候補として、出力候補リストに登録する。
【０２６４】
以下、ＸＭＬデータベースが図４６に示したような状態である場合を例にとり説明するが、この場合のテンプレートツリーの構造も、図４６と同様である。但し、図４６に示すように、「ＸＭＬ」や「ＳＧＭＬ」などの語彙は、テンプレートツリーには含まれていない。また、図４６には、構成要素の属性は、２重の四角形で囲み、示している。
【０２６５】
まず、テンプレートツリーを下流側にある構成要素から上流へ向かって探索して、上記親子関係をチェックする。
【０２６６】
例えば、テンプレートＩＤ「ＴＩＤ６」の構成要素の親要素（１段上の階層にある構成要素であって、テンプレートＩＤ「ＴＩＤ６」の構成要素を包含する構成要素）は、テンプレートＩＤ「ＴＩＤ４」の構成要素である。このテンプレートＩＤ「ＴＩＤ４」の構成要素は、ｒｅｔｕｒｎ節の処理により抽出された構成要素であるので、この時点で、テンプレートＩＤ「ＴＩＤ４」は出力候補リストに追加する。
【０２６７】
続いて、テンプレートＩＤ「ＴＩＤ７」の構成要素に対してであるが、これも上記同様に、テンプレートＩＤ「ＴＩＤ４」が親要素である。テンプレートＩＤ「ＴＩＤ４」は、既に出力候補リストに登録されているので、次に進む。
【０２６８】
テンプレートＩＤ「ＴＩＤ１５」の親要素は、テンプレートＩＤ「ＴＩＤ１４」の構成要素であるが、これは、ｒｅｔｕｒｎ節の処理により抽出された構成要素ではないので、更に上流に遡る。「ＴＩＤ１４」の１段上の階層には「ＴＩＤ１３」があり、さらに、その１段上の階層には「ＴＩＤ１２」があり、さらに、その１段上の階層には「ＴＩＤ１」がある。これらはどれもｒｅｔｕｒｎ節の処理により抽出された構成要素ではないので、テンプレートＩＤ「ＴＩＤ１５」はスキーマ検索から除外する。
【０２６９】
同様にして、テンプレートＩＤ「ＴＩＤ１７」も除外される。
【０２７０】
テンプレートＩＤ「ＴＩＤ５７」、テンプレートＩＤ「ＴＩＤ５８」は、それぞれ上流にテンプレートＩＤ「ＴＩＤ５５」を持ち、これはｒｅｔｕｒｎ節の処理により抽出された構成要素であるから、当該テンプレートＩＤ「ＴＩＤ５５」を出力候補リストに登録する。
【０２７１】
この時点での、出力候補リストには、「ＴＩＤ４」と「ＴＩＤ５５」が登録されている。
【０２７２】
次に、テンプレートツリーを上流から下流に探索して、上記親子関係をチェックする。
【０２７３】
ｒｅｔｕｒｎ節の処理に抽出されたテンプレートＩＤは、「ＴＩＤ４」と「ＴＩＤ５５」と「ＴＩＤ７２」であったが、このうち、「ＴＩＤ４」と「ＴＩＤ５５」は、既に、出力候補リストに登録されている。従って、例えばここでは、残りの「ＴＩＤ７２」も出力候補リストに追加登録する。もちろん、この場合、「ＴＩＤ７２」は出力候補リストに登録しなくともよい。
【０２７４】
この時点で出力候補リストには、「ＴＩＤ４」、「ＴＩＤ５５」、「ＴＩＤ７２」が登録されている。
【０２７５】
次に、図４３のステップＳ３０４のスキーマ推定処理により、先に抽出された構成要素を先頭とする文書構造から、抽出すべき構成要素群を抽出する。
【０２７６】
図４６に示したように、「ＴＩＤ４」を先頭する文書構造を構成する構成要素群は、｛ＴＩＤ４（Ｓ１．０）、ＴＩＤ５、ＴＩＤ６（Ｄ１．７２）、ＴＩＤ７（Ｓ１．０）、ＴＩＤ８、ＴＩＤ９、ＴＩＤ１０、ＴＩＤ１１｝である（図４７参照）。
【０２７７】
まず、図３７に示したようなテンプレートツリーに記述されている情報（例えば、「ｒｅｑｕｉｒｅｄ」という属性の値）を基に、これら各構成要素の中から、「ＴＩＤ４」を先頭する文書構造を構成するために必ず必要となる構成要素を選択する。
【０２７８】
上記構成要素のうち、テンプレートツリー上に、「ＴＩＤ４」を先頭する文書構造を構成するために必須と指定されているもの（すなわち、「ｒｅｑｕｉｒｅｄ」という属性の値が「ｙｅｓ」となっている構成要素）は、「ＴＩＤ５」、「ＴＩＤ６」、「ＴＩＤ７」、「ＴＩＤ８」であるので、これら構成要素からなる文書構造（スキーマ）が検索結果の１つして求めることができる。
【０２７９】
なお、図４８に、上記文書構造をＸＭＬＳＣＨＥＭＡで記述した場合のスキーマ文書（ＸＭＬ文書）を示す。この文書には、＜ｘｓｄ：ｓｅｑｕｅｎｃｅ＞が付加されることになる。
【０２８０】
同様に、「ＴＩＤ５５」を先頭する文書構造を構成する構成要素群は、｛ＴＩＤ５５、ＴＩＤ５６、ＴＩＤ５７、ＴＩＤ５８｝である（図４９参照）。これら全ての構成要素は、テンプレートツリー上に、「ＴＩＤ５５」を先頭する文書構造を構成するために必須と指定されているので、上記構成要素群からなる文書構造（スキーマ）が検索結果の他の１つして求めることができる。
【０２８１】
なお、図５０に、上記文書構造をＸＭＬＳＣＨＥＭＡで記述した場合のスキーマ文書（ＸＭＬ文書）を示す。
【０２８２】
同様に、「ＴＩＤ７２」を先頭する文書構造を構成する構成要素群は、｛ＴＩＤ７２、ＴＩＤ７３、ＴＩＤ７４、ＴＩＤ７５｝である（図５１参照）。これら全ての構成要素は、テンプレートツリー上に、「ＴＩＤ７２」を先頭する文書構造を構成するために必須と指定されているので、上記構成要素群からなる文書構造（スキーマ）が検索結果のさらに他の１つして求めることができる。図３７に示したようなテンプレートツリーには、「ＴＩＤ７５」の構成要素については、数値型を定める情報、すなわち、「ｄａｔａＴｙｐｅ」が記述されているので、この「年」という構成要素に対しては数値型のスキーマが設定される。
【０２８３】
このように、図４３のステップＳ３０３で、文書構造の先頭の構成要素を求めたら、ステップＳ３０４では、テンプレートツリーから当該構成要素を先頭とする文書構造を構成する構成要素群を抽出する。その際、上記のように、テンプレートツリーに記述されている各構成要素についての記述を基に、抽出すべき構成要素を選択するようにしてもよいし、選択せずに、当該文書構造を構成する全ての構成要素を抽出してもよい。
【０２８４】
また、抽出した文書構造をＸＭＬ文書、すなわち、図４８や図５０に示したようなスキーマ文書に変換する。このスキーマ文書に変換する際、テンプレートツリー上の当該構成要素に対し記述されている情報を基に、各構成要素の属性（例えば、数値型など）などを記述する。
【０２８５】
このように、テンプレートツリーには、図３７に示したようなＸＭＬ文書として、スキーマを検索し、スキーマ文書を作成する際に用いる情報が全て記述されている。
【０２８６】
次に、図４３のステップＳ３０５に進み、検索結果として得られた各スキーマに対し、評価点（スコア）を算出する。
【０２８７】
例えば、検索結果として得られた、テンプレートＩＤ「ＴＩＤ４」の構成要素を先頭とするスキーマを構成する構成要素群はその第１の評価値（Ｓ）や第２の評価値（Ｄ）とともに示すと、｛ＴＩＤ４（Ｓ１．０）、ＴＩＤ５、ＴＩＤ６（Ｄ１．７２）、ＴＩＤ７（Ｓ１．０）、ＴＩＤ８、ＴＩＤ９、ＴＩＤ１０、ＴＩＤ１１｝であった。このスキーマの評価点は、各構成要素に与えられた第１の評価値と第２の評価値の総和であり、Ｓ１．０＋Ｓ１．０＋Ｄ１．７２＝３．７２となる。
【０２８８】
ここで、総和を求める際には、第１の評価値や第２の評価値にそれぞれ重みを設定してもよい。上記例の場合、第１の評価値や第２の評価値のそれぞれについて重みを「１」として計算している。
【０２８９】
また、検索結果として得られた、テンプレートＩＤ「ＴＩＤ５５」の構成要素を先頭とするスキーマを構成する構成要素群はその第１の評価値（Ｓ）や第２の評価値（Ｄ）とともに示すと、例えば、｛ＴＩＤ５５（Ｓ０．８）、ＴＩＤ５６、ＴＩＤ５７（Ｓ１．０）、ＴＩＤ５８｝であったとする。このスキーマの評価点は、上記同様にして求めると、Ｓ０．８＋Ｓ１．０＝１．８となる。
【０２９０】
さらに、検索結果として得られた、テンプレートＩＤ「ＴＩＤ７２」の構成要素を先頭とするスキーマを構成する構成要素群はその第１の評価値（Ｓ）や第２の評価値（Ｄ）とともに示すと、｛ＴＩＤ７２（Ｓ１．０）、ＴＩＤ７３、ＴＩＤ７４、ＴＩＤ７５｝であったとする。このスキーマの評価点は、上記同様にして求めると、「１．０」となる。
【０２９１】
次に、図４３のステップＳ３０６に進み、出力データを作成する。
【０２９２】
ここでは、例えば、ステップＳ３０４において、検索結果として抽出された各スキーマのスキーマ文書に、上記ステップＳ３０５で求めた評価点を書き込み、このスキーマ文書を１つのＸＭＬ文書としてまとめて、例えば、図５２に示したような出力データを作成する。図５２に示したＸＭＬ文書は、評価点の高い物から順にスキーマ文書を並べて表示するためのものである。
【０２９３】
スキーマ検索処理部３０３で作成された、図５２に示したような出力データ（検索結果文書）は（必要に応じて、当該検索結果文書を表示するためのスタイルシートとともに）、結果処理部１２から、クライアント端末１０２へ送信される。
【０２９４】
クライアント端末１０２では、図５２に示した検索結果文書を受け取ると、当該検索結果文書とともに送られてきた（あるいは、予め定められた）スタイルシートを用いて、当該検索結果文書を表示するとともに、例えば、その中からユーザにより選択されたスキーマを、当該検索結果文書とともに送られてきた（あるいは、予め定められた）スタイルシートを用いて表示する。図５３は、検索結果として得られたスキーマの表示例を示したものであり、図５２に示した検索結果文書中の最初のスキーマ（テンプレートＩＤ「ＴＩＤ４」の構成要素を先頭とする文書構造）の表示例を示している。
【０２９５】
図５３に示す表示例では、そのまま入力フォームとなるように、入力領域が設けられている。
【０２９６】
なお、検索結果として得られたスキーマをもつＸＭＬ文書のＸＭＬデータベースの階層構造上の格納位置（構造化文書パス）は、テンプレートツリー上の当該スキーマの位置と同じであるから、例えば、図５３に示した入力フォームからデータを入力することにより作成された新規なＸＭＬ文書をＸＭＬデータベースに格納する場合、当該構造化文書パスを挿入コマンドや追加コマンドのパスの指定に用いればよい。よって、ユーザは、新規に作成したＸＭＬ文書の格納位置をどこにすればよいかと迷うことなく、当該新規なＸＭＬ文書をＸＭＬデータベース上の適切な格納位置に格納することができる。
【０２９７】
このようにして、上記文書構造の検索結果を用いて新規にＸＭＬ文書を作成した場合、内容的に類似する（同じカテゴリに属する）ＸＭＬ文書は、ほぼ同じ文書構造のものとなり（スキーマの氾濫を極力抑えることができ）、また、ＸＭＬデータベース上の格納位置もユーザがわざわざ指定することもないので、内容的に類似する（同じカテゴリに属する）ＸＭＬ文書は、ＸＭＬデータベース上で管理し易い格納位置にまとめられて記憶することができる（例えば、図４６において、先頭が「特許」という要素名の構成要素の文書であれば、それらは全て「特許ＤＢ」ノード以下に格納される）。
【０２９８】
次に、スキーマ検索のための検索条件の入力画面上に、ユーザは、検索したいスキーマとして、例えば、「ＸＭＬ関連のスキーマ」という、上記（条件１）、すなわち、要素名あるいは要素値あるいは属性値に含まれる文字列のみを指定した、検索条件を入力したとする。この場合は、キーワードレベルの漠然とした要求しかない場合である。
【０２９９】
この様な検索条件が入力された場合、クライアント端末１０２、あるは構造化文書管理システムの要求受付部１１では、例えば、「ＸＭＬ」という文字列を要素値に含む文書構造を検索する検索要求文（スキーマクエリ）が作成されたとする。
【０３００】
このスキーマクエリが、前述同様にして、図４６に示すようなＸＭＬデータベースをもつ構造化文書管理システムのスキーマ検索処理部３０３にて処理される。ｗｈｅｒｅ節の処理では、「ＸＭＬ」とこれに類似する語彙を表す文字列（例えば、「ＳＧＭＬ」など）のいずれかを要素値にもつ構成要素のテンプレートＩＤとして、「ＴＩＤ６」、「ＴＩＤ１７」「ＴＩＤ５８」が語彙推定処理により抽出される。なお、「ＸＭＬ」の類似度は「１．０」、「ＳＧＭＬ」の類似度は「０．６４」であるので、これら抽出された各構成要素をその評価値とともに示すと、｛ＴＩＤ６（Ｄ１．６７）、ＴＩＤ１７（Ｄ１．０）、ＴＩＤ５８（Ｄ１．０）｝となる。
【０３０１】
次に、ｒｅｔｕｒｎ節の処理であるが、この場合は、ｒｅｔｕｒｎ節に記述するようなユーザにより指定された条件そのものが存在しない。この場合、スキーマクエリのｒｅｔｕｒｎ節には、例えば、「＜ｘｓｐ：ｒｅｔｕｒｎｔｙｐｅ＝”ａｌｌ”／＞」と記述されているので、スキーマ検索処理部３０３では、テンプレートツリー上の文書のルートノードを判別し（すなわち、前述したように、テンプレートツリー上では、１つのＸＭＬ文書の先頭の構成要素には「ｄｏｃｕｍｅｎｔ」という属性値が記述されているので、これをチェックすることによりルートノードを判別することができる）、テンプレートツリーから、ルートノードを先頭とする文書構造であって、ｗｈｅｒｅ節を処理することにより抽出された構成要素（「ＴＩＤ６」、「ＴＩＤ１７」、「ＴＩＤ５８」）のいずれかを含む文書構造を取り出すこととなる。
【０３０２】
すなわち、この場合、図４３のステップＳ３０３へ進む。まず、テンプレートＩＤ「ＴＩＤ６」の構成要素について、その上流に遡ると、「ＴＩＤ４」に至るが（図４６参照）、この構成要素について、図３７に示したようなテンプレートツリーの記述をチェックする。すると、この構成要素には、「＜ｘｓｐ：Ｎｏｄｅｔｙｐｅ＝“ｄｏｃｕｍｅｎｔ”＞」と記述されているので、この「ＴＩＤ４」を先頭とする文書構造を抽出候補とする。すなわち、「ＴＩＤ４」を出力候補リストに登録する。
【０３０３】
同様にして、「ＴＩＤ１４」、「ＴＩＤ５」８も出力候補リストに追加されることになる。
【０３０４】
つまり、この時点で、抽出すべき文書構造として、
▲１▼｛ＴＩＤ４、ＴＩＤ５、ＴＩＤ６、ＴＩＤ７、ＴＩＤ８、ＴＩＤ９、ＴＩＤ１０、ＴＩＤ１１｝
▲２▼｛ＴＩＤ１４、ＴＩＤ１５、ＴＩＤ１６、ＴＩＤ１７｝
▲３▼｛ＴＩＤ５５、ＴＩＤ５６、ＴＩＤ５７、ＴＩＤ５８｝
という３つの文書構造が挙がる。
【０３０５】
以上３つの文書構造を構成する各構成要素のうち、テンプレートツリーの記述を基に、必須な構成要素のみを選択するなどして、抽出すべき構成要素群（スキーマ）を抽出する（図４３のステップＳ３０４）。
【０３０６】
さらに、この検索結果として得られた各スキーマに対し、評価点を求めて（図４３のステップＳ３０５）、出力データを作成する。
【０３０７】
なお、以上のようにして、検索されたスキーマについて、予めスキーマが設定されている場合（予め定められたスキーマに従って記述された文書のスキーマである場合）には、当該スキーマを記述したスキーマ文書をＸＭＬデータベースから検索して、それを検索結果のスキーマとして用いてもよい。
【０３０８】
また、ＸＱｕｅｒｙなどによって検索した結果に対して、データとともに付加的にスキーマ情報をクライアントに通知するようにしてもよい。ＸＱｕｅｒｙで検索した結果はデータだけになり、それらの構造が見えにくい部分があるが、それを補完することが可能となる。また、その結果情報として、テンプレート要素に対して、累積カウントを記述することで、それらをブラウジングすることで、データベース中のパスと、その頻度を一目で把握することも容易となる。
【０３０９】
また、アクセス権限をテンプレートツリー上の各構成要素に設定されていてもよい。例えば、部署Ａと部署Ｂがあり、それぞれの管轄範囲のＸＭＬ文書が、ＸＭＬデータベース上の予め定められた領域に格納されているとする。この場合、部署Ａの管轄文書に対して、部署Ｂはアクセスできないが、文書Ｂにはスキーマのみであれば公開するといったアクセス権限が設定されていてもよい。スキーマを公開することで、例えば、後に部署Ａ，部署Ｂ内のアプリケーション連携を行う際に、それらスキーマの違いを吸収することなく、コスト削減が行える。
【０３１０】
また、各種文書をＸＭＬデータベース内に格納しようとした場合、このデータはどこに格納するのが適しているか、などが分からなくなり、一時的に適当な位置に格納しておくだけ、といった場合が多く、その場合、その情報は埋もれています場合が多い。ＸＭＬデータベース管理下においては、それらがサーバ側で管理されており、１つのＸＭＬツリーとして表されているために、格納対象の文書をどこに格納すれば良いかの指針も示してくれる。さらにいえば、格納場所はフォルダである必要もなく、部分文書であってもよい。よって、スキーマの氾濫を極力減らすことも可能である。
【０３１１】
以上のように、上記実施形態によれば、階層構造をもつＸＭＬデータベースから、構造、語彙が確定していない曖昧な条件から所望のスキーマおよび、そのスキーマを抽出したＸＭＬデータベース中の位置を、スキーマの明示的な定義の有無を問わず検索することが容易に可能となる。
【０３１２】
【発明の効果】
以上説明したように、本発明によれば、階層構造を有するＸＭＬデータベースから漠然とした検索条件を基に文書構造を効率よく検索することができる。
【図面の簡単な説明】
【図１】本発明の実施形態に係る構造化文書管理システムの構成例を示した図。
【図２】図１に示した構造化文書管理システムの一利用形態を示したもので、ＷＷＷのバックエンドで、構造化文書管理システムが動作している場合を示した図。
【図３】ＸＭＬで記述された構造化文書の一例を示した図。
【図４】図３の構造化文書の文書構造を模式的に示した図。
【図５】追加コマンドの機能を説明するための図で、構造化文書データベースの初期状態に追加コマンドを実行した場合について示している。
【図６】図５（ｂ）に示した状態の構造化文書データベースに対し、取得コマンドを実行した場合の処理結果を示した図。
【図７】図５（ｂ）に示した状態の構造化文書データベースに対し、追加コマンドを実行して１つの「特許」情報の文書オブジェクトツリーを追加した場合を示している。
【図８】図５（ｂ）に示した状態の構造化文書データベースに対し、追加コマンドを実行して３つの「特許」情報の文書オブジェクトツリーを追加した場合を示している。
【図９】要素名生起インデックスの格納例を示した図。
【図１０】データ生起インデックスの格納例を示した図。
【図１１】図８に示した状態の構造化文書データベースに対して、３つの「特許」情報を取り出すための取得コマンドを実行した場合の実行結果を示した図。
【図１２】ＸＭＬ文書の文書構造を定義するスキーマの一例を示した図。
【図１３】図８に示した状態の構造化文書データベースに、スキーマ格納コマンドを実行して、図１２に示したスキーマを追加格納（設定）した場合を示した図。
【図１４】スキーマが設定されて、スキーマが存在している旨の属性値のセットされた文書オブジェクトツリーを示した図。
【図１５】各オブジェクトファイルに、スキーマが存在している旨の属性値が格納されている様子を概念的に示した図。
【図１６】必要に応じて検索で使用される概念階層を構造化文書で表現した例を示した図。
【図１７】必要に応じて検索で使用される概念階層を構造化文書で表現した例を示した図。
【図１８】図８に示した状態の構造化文書データベースに対し、追加コマンドを実行して、図１６，図１７に示した「概念」情報の文書オブジェクトツリーを追加した場合を示した図。
【図１９】図８に示した状態の構造化文書データベースに対し、追加コマンドを実行して、図１６，図１７に示した「概念」情報の文書オブジェクトツリーを追加した場合を示した図。
【図２０】クエリ（ＸＭＬ文書）の一例を示した図。
【図２１】単純検索のクエリ（ＸＭＬ文書）の一例を示した図。
【図２２】図２１の単純検索のクエリを用いた検索結果（ＸＭＬ文書）を示した図。
【図２３】概念検索のクエリ（ＸＭＬ文書）の一例を示した図。
【図２４】図１の構造化文書管理システムの文書検索処理動作について説明するためのフローチャート。
【図２５】ユーザインタフェースとしての画面の表示例を示した図。
【図２６】文書検索を行うためのユーザインタフェースとしての画面の表示例を示した図。
【図２７】図２６に示した画面上から入力された情報に基づき作成されるクエリを示した図。
【図２８】図１の構造化文書管理システムの文書取得処理動作について説明するためのフローチャート。
【図２９】文書取得コマンドを実行した結果得られた構造化文書の表示例を示した図。
【図３０】文書検索を行うためのユーザインタフェースとしての画面の表示例であって、スキーマの検索処理動作を説明するための図。
【図３１】スキーマ検索のクエリの一例を示した図。
【図３２】スキーマの取得するためのユーザインタフェースとしての画面の表示例を示したもので、取得されたスキーマの表示例を示している。
【図３３】テンプレート処理部の構成例を示した図。
【図３４】テンプレートツリーの一例を示した図。
【図３５】ＸＭＬ文書の具体例を示した図。
【図３６】図３４に示したテンプレートツリーの更新後を示した図。
【図３７】テンプレートツリーの記憶例を示した図。
【図３８】スキーマ検索処理部３０３の構成例を示した図。
【図３９】スキーマクエリの一例を示した図。
【図４０】スキーマクエリの他の例を示した図。
【図４１】意味ネットワークの具体例を示した図。
【図４２】意味ネットワークを用いて検索条件として指定されたキーワードの類似語を求める処理を説明するためのフローチャート。
【図４３】スキーマ検索処理部の処理動作を説明するためのフローチャート。
【図４４】スキーマ検索のための検索条件の入力画面の表示例を示した図。
【図４５】図４４に示した入力画面から入力された検索条件に基づき作成されたスキーマクエリの一例を示した図。
【図４６】テンプレートツリーの具体例を示した図。
【図４７】検索した結果得られた文書構造を示した図。
【図４８】図４７に示した文書構造を記述したＸＭＬ文書の一例を示した図。
【図４９】検索した結果得られた他の文書構造を示した図。
【図５０】図４９に示した文書構造を記述したＸＭＬ文書の一例を示した図。
【図５１】検索した結果得られたさらに他の文書構造を示した図。
【図５２】出力データの一例を示した図。
【図５３】検索結果として得られたスキーマの表示例を示した図。
【図５４】スキーマクエリ中のｗｈｅｒｅ節の処理手順を説明するためのフローチャート。
【図５５】構造推定処理を説明するためのフローチャート。
【図５６】語彙推定処理を説明するためのフローチャート。
【図５７】スキーマクエリ中のｒｅｔｕｒｎ節の処理手順を説明するためのフローチャート。
【符号の説明】
１００…構造化文書管理システム
３０３…スキーマ検索処理部
３０６…テンプレート処理部
３０６ａ…構造抽出部
３０６ｂ…テンプレートツリー処理部
３２１…スキーマクエリ解析部
３２２…スキーマクエリ条件処理部
３２３…スキーマクエリ出力処理部
４０４…テンプレート記憶部
４０５…意味ネットワーク記憶部[0001]
[Technical Field]
The present invention relates to a document structure search method and a structured document search apparatus for searching for a desired document structure from different document structures of a plurality of structured documents stored in a database having a hierarchical structure.
[0002]
[Prior art]
As a database for storing structured documents such as documents described in XML (Extensible Markup Language), each of a plurality of structured documents having a hierarchical structure and different document structures is a partial document on one large tree structure. Some are stored and managed as
[0003]
A document structure of a structured document such as a document described in XML (hereinafter also referred to as XML document, XML data, etc.) is called a schema. Schemas can also be specified by data creators explicitly using a schema language (XML-SCHEMA, Relax NG, etc.) (a schema document that describes a schema is called a schema document, which is also structured) However, it is generally troublesome to describe a schema in a schema language. Furthermore, unlike SGML, XML does not require a schema to be defined for data by a schema language, and is merely an optional position.
[0004]
Unlike the XML database having the hierarchical structure as described above, there is also an XML database having a flat (non-hierarchical) data structure for storing each XML document in folder units. The schema is stored in folder units and is not managed in a hierarchical manner.
[0005]
In the case of an XML database having a hierarchical structure and managing XML documents in one large tree structure (hereinafter referred to as a hierarchical XML database), the document structure of each structured document is managed as a part of the hierarchical structure. It will be.
[0006]
Although it is not always necessary to determine the schema in advance, if all the XML documents stored in the hierarchical XML database have different document structures, it cannot be denied that they are difficult to manage. From the viewpoint of efficiently managing stored documents, it is desirable that XML documents that are similar in content (belonging to the same category) have the same document structure rather than completely different document structures.
[0007]
When an XML document creator creates an XML document with a disjoint document structure, the document structure overflows, and the hierarchy or the XML database cannot be used more than a simple search engine, and the merits of XML may be lost.
[0008]
In other words, in the hierarchical XML database, it is desirable to suppress the flooding of the schema as much as possible. For this reason, it is conceivable that before creating an XML document, a schema of XML documents having similar contents in the hierarchical XML database is searched, and data is created using the schema.
[0009]
However, the conventional hierarchical XML database has the following two problems.
[0010]
The first problem is that there is no effective means for searching only the desired document structure in the hierarchical XML database. If schemas are defined for all stored documents in the schema language, the schema search can be facilitated by managing them in the database, but this is the case because the schema language is optional. Only the registered schema language is searched.
[0011]
Another method is to specify an ambiguous structure and examine the XML data as a result of searching the data with XQuery, etc., and extract the structure with the eyes of the user, but the cost in this case is also high . I want a schema, not data.
[0012]
For example, an arbitrary element name is designated as a search condition, and DTD (Document Type Definition) information that the document conforms to is returned as a search result (see, for example, Patent Document 1). However, according to this method, information cannot be obtained for a document that does not comply with DTD, and the granularity is also a document unit, and only the components of the DTD are returned (only the partial document schema is returned). (Return) is not possible.
[0013]
Also, there is a technique for searching similar documents from seed documents and displaying them (for example, see Patent Document 2). In the method disclosed here, it is necessary to visually determine which schema the acquired document follows, and since the search result is also a document unit, the schema included in the partial document is extracted. However, there remain problems such as low similarity.
[0014]
Furthermore, there is a technique for managing newly registered element names and attribute names in a table (see, for example, Patent Document 3). However, this method is effective only with a database that manages XML documents in a flat manner, and does not consider a hierarchical XML database in which the document structure is managed hierarchically.
[0015]
The second problem is that the search target is ambiguous and vague, such as how the search should be performed and what results should be obtained when the user tries to search the document structure. Since there are many cases, there is no method for specifying a search condition for a document structure including ambiguity so as to cope with such a case.
[0016]
In terms of what kind of search is performed, for example, when a schema of “book” information is to be searched, a schema element that defines an element called a book is generally searched. However, there is a case where it is not precisely defined as “book” and a search is made with a schema defining “book” or “book”.
[0017]
In addition, there is a request to search a schema that can be summarized only from keywords, vaguely. In this case, it is not only possible to determine whether what this keyword means is an element, an attribute, or a value, but the intention is to search for any schema that stores vague data. It has been.
[0018]
In this way, the user is vague at the time of searching the schema, is not fixed, and should be structured so that it can be absorbed.
[0019]
In a conventional search method (see, for example, Patent Document 2), a search is not output unless the tag name completely matches. If DTD is specified in advance, the structure according to the DTD is shown, but it is out of the main point of the present invention to search the schema.
[0020]
For any result, information that the user wants to use should be added instead of simply returning the schema language.
[0021]
Even if the schema is not defined by the schema language, when the user wants to manage data as an XML document, the user implicitly assumes the schema and creates a document accordingly. Furthermore, there is a strong tendency for data having similar schemas to be managed in the same path in a hierarchical database. These can be regarded as implicit schema definitions, and a management method is required in the same way as explicit schema definitions in the schema language.
[0022]
In addition, there are mainly two types of use of schema search (document structure search) depending on the user's intention.
[0023]
The first is a case where a schema to be searched to some extent can be predicted. For example, when searching for “patent-related schema”, the schema to be extracted is searched for a schema similar to the schema that defines the XML document of “patent” information.
[0024]
The second is that even though the data exists but the structure and how it is put together is not yet determined, we want to search for a similar schema. In this case, the main point is how to present a useful schema to the user from the data to be stored.
[0025]
Even in the search result, it is assumed that the database is a hierarchical database and the schemas are also arranged hierarchically.
[0026]
[Patent Document 1]
JP 2000-200266 A
[0027]
[Patent Document 2]
JP 2001-14326 A
[0028]
[Patent Document 3]
JP 2002-7439 A
[0029]
[Problems to be solved by the invention]
Thus, conventionally, there has been no method for efficiently searching only the document structure from the XML database having a hierarchical structure while taking into account the ambiguity of the search condition.
[0030]
Accordingly, an object of the present invention is to provide a document structure search method, a document structure search apparatus, and a program that can efficiently search a document structure from an XML database having a hierarchical structure based on a vague search condition.
[0031]
[Means for Solving the Problems]
The present invention is for retrieving a desired document structure from among different document structures of a plurality of structured documents stored in a database having a hierarchical structure, and the hierarchical structure includes the different document structures. When at least one keyword is input as a search condition for searching the document structure, each keyword is searched for a similar word of the keyword, and the keyword is searched from the hierarchical structure. And a similar element are included in the element name or element value, and a document structure including the searched element is output as a search result.
[0032]
According to the present invention, by simply inputting at least one keyword, it is possible to efficiently obtain a document structure including a component having an element name or an element value that is one of the keywords and a word in the similar range. it can.
[0033]
The present invention is also for retrieving a desired document structure from among different document structures of a plurality of structured documents stored in a database having a hierarchical structure, wherein the hierarchical structure includes the different documents. Each of the components constituting each of the structures. When at least one first keyword and at least one second keyword are input as a search condition for searching the document structure, A similar word of the first keyword is obtained, a first component including either the first keyword or the similar word in an element name or an element value is searched from the hierarchical structure, and the second And search for a second component that includes any one of the second keyword and the similar word in the element name from the hierarchical structure, and at least Among the document structure to the top of the second component, and outputs the document structure comprising the first component as a search result.
[0034]
According to the present invention, by simply inputting at least one first keyword and at least one second keyword, any one of the first keyword and a word in the similar range is selected as an element name or element value. It is possible to efficiently obtain a document structure including a constituent element having a constituent element having the element name of any one of the words in the similar range to the second keyword.
[0035]
DETAILED DESCRIPTION OF THE INVENTION
First, before describing an embodiment of the present invention, a basic configuration and processing operation of a structured document management system will be described.
[0036]
(Description of structured document management system)
Examples of structured documents include documents described in XML, SGML, and the like. SGML (Standard Generalized Markup Language) is a standard defined by ISO (International Organization for Standardization). XML (extensible Markup Language) is a standard defined by W3C (World Wide Web Consortium). These are structured document standards that allow documents to be structured.
[0037]
Hereinafter, description will be given by taking a document described in XML as a structured document. Data defining the document structure of a structured document (document structure definition data) is called a schema. In XML, schema languages such as XML-Schema and XDR (XMLData Reduced) have been proposed to define a schema. Here, for example, a case where a schema is described in XDR will be described as an example.
[0038]
The schema is also a structured document to be managed by the structured document management system, and may be referred to as a schema document here. A structured document other than a schema document, which has various miscellaneous contents such as patent specifications, mails, weekly reports, and advertisements, may be referred to herein as a content document.
[0039]
In the structured document management system, the schema document, the content document, and a query describing a search request from a user as will be described later, that is, a query document is also managed, and these are collectively referred to as “document”. .
[0040]
Hereinafter, when there is no special notice, when referring to “document”, it means all content documents, schema documents, and query documents.
[0041]
First, XML will be briefly described before describing the embodiment.
[0042]
FIG. 3 shows an example of “patent” information as an example of a structured document described in XML. In XML and SGML, tags are used to express the structure of a document. Tags include a start tag and an end tag. Each component of the document structure is surrounded by a start tag and an end tag. The start tag is the element name of the constituent element closed with “>”, and the end tag is the element name closed with the symbols “</” and “>”. The content of the component following the tag is a text (character string) or a repetition of a child component. Also, attribute information such as “<element name attribute =“ attribute value ”>” can be set in the start tag. A component that does not include text, such as “<patent DB></ patent DB>”, can also be expressed as “<patent DB />” as a simple notation.
[0043]
The document shown in FIG. 3 has an element starting from a “patent” tag as a root and elements starting from “title”, “application date”, “applicant”, and “summary” tags as its child elements. For example, one text (character string) such as “XML database” exists as an element value for an element starting from a “title” tag.
[0044]
In general, a structured document such as XML repeatedly includes arbitrary constituent elements, and further, the document structure is not predetermined.
[0045]
In order to logically express the structured document as shown in FIG. 3, a tree expression as shown in FIG. 4 is used. The tree is composed of nodes (numbered and indicated by circles), arcs (line with data connecting the circles representing the nodes), and text surrounded by a rectangle.
[0046]
One node corresponds to one component, that is, one document object. A plurality of arcs with labels corresponding to tag names and attribute names appear from the nodes. The destination of the arc is a character string (text) as a node value or element value. Alphanumeric characters (for example, “# 0”, “# 49”) described in the node are object IDs for identifying each document object.
[0047]
The tree structure shown in FIG. 4 is called the document object tree of the structured document shown in FIG.
[0048]
FIG. 1 shows an example of the structure of a structured document management system according to this embodiment. In FIG. 1, the structured document management system is roughly composed of a request control unit 1, an access request processing unit 2, a search request processing unit 3, a data access unit 4, a document storage unit 5, and an index storage unit 6. Yes. The document storage unit 5 and the index storage unit 6 are composed of, for example, an external storage device.
[0049]
The system configuration of FIG. 1 can be realized using software.
[0050]
The request control unit 1 includes a request reception unit 11 and a result processing unit 12. The request reception unit 11 receives a request from the user such as document storage, document acquisition, and document search, and calls the access request processing unit 2. The result processing unit 12 performs processing for returning the result processed by the access request processing unit 2 to the requesting user.
[0051]
The access request processing unit 2 includes a plurality of processing units corresponding to various requests from the user such as document storage, document acquisition, and document deletion. That is, the document storage unit 21, the document acquisition unit 22, and the document deletion unit 23 are configured.
[0052]
The document storage unit 21 performs processing for storing a document in a specified logical area in the document storage unit 5.
[0053]
When a logical area in the document storage unit 5 is specified, the document acquisition unit 22 performs processing for acquiring a document existing in the specified area.
[0054]
The document deletion unit 23 performs a process of deleting a document existing in a designated logical area in the document storage unit 5.
[0055]
The document storage unit 5 is a structured document database. For example, as shown in FIG. 8, documents are hierarchically stored in a tree structure like a UNIX directory structure.
[0056]
As shown in FIG. 8, the structured document database can be expressed in the same manner as the tree structure of one structured document as shown in FIG. That is, a partial hierarchical tree (partial tree) below an arbitrary node is a structured document cut out from the structured document database, and here, this is called a document object tree. Each node is assigned an object ID. The object ID is a unique numerical value in the structured document database.
[0057]
It is assumed that an object ID “# 0” for specifying that the node is the root node is assigned to the node serving as the root of the hierarchical tree.
[0058]
A link is made from the root node, that is, the node of “# 0” to the node of the object ID “# 1” having the “root” tag at the head. A link from an “# 1” node to a node with an object ID “# 2” having a “patent DB” tag at the head is provided. From the “# 2” node, links to a node with an object ID “# 42”, a node with “# 52”, and a node with “# 62” having a “patent” tag at the head are provided.
[0059]
The “patent” information shown in FIG. 3 corresponds to the partial tree below the “# 42” node in FIG. From this node, a link is made to a node having “title” tag, “applicant” tag, “summary” tag, etc. at the head, and from the end node, “XML database”, “T company”, “XML” A link to a character string (element value) such as “Provide a unified database” is provided.
[0060]
In FIG. 8, the partial tree below the node with the object ID “# 52” and the partial node below the node with the object ID “# 62” are also document object trees corresponding to one “patent” information.
[0061]
By the way, for example, the element value “XML database” linked to the “# 43” node is connected to the “# 43” node by a special tag name “#value”. Since this tag name starts with “#”, it cannot be used as a standard tag name in the XML standard.
[0062]
A structured document path is used to designate a specific node of such a structured document database. The structured document path is a character string that starts with “ux: /// root”. uix (Universal Identifier for XML) is a character string indicating a structured document path.
[0063]
For example, when “uix: // root / patent DB” is expressed as a structured document path, a logical area in the document storage unit 5 indicated by the structured document path is from a “# 1” node in FIG. The node indicated by the arc to which “patent DB” is assigned, that is, the “# 2” node.
[0064]
Similarly, the structured document path “uix: // root / patent DB / patent” points to the “# 42” node in FIG. 8, and the structured document path “uix: // root / patent DB / application date / “Year” indicates the “# 45” node in FIG.
[0065]
For example, in FIG. 8, in the case where a plurality of “patent” information is stored below the “# 2” node, that is, in the component “patent DB”, in order to identify each “patent” information, An index may be added to the name (eg, “patent” in this case).
[0066]
If it is the first “patent” information of “patent DB”, it will be “uix: // root / patent DB / patent [0]”, which is the same as “uix: // root / patent DB / patent” It is considered. If it is the second “patent” information of the “patent DB”, it is “uix: // root / patent DB / patent [1]”, and if it is the fifth “patent” information of the “patent DB”, it is “uxix”. // root / patent DB / patent [4] ”.
[0067]
The index storage unit 6 stores an element name occurrence index and a data occurrence index used at the time of retrieval.
[0068]
The element name occurrence index is an index file that associates the element name stored in the structured document database with the position of the structured document (document object tree) having the element name component at the head. For example, in the structured document database of FIG. 8, a structured document whose element name “patent” (corresponding to “patent” information) is a node below “# 42” node, a structured document below a node “# 52”, “ As shown in FIG. 9, the element name occurrence index includes the “# 42” node, the “# 52” node, and the parent node of the “# 62” node, as shown in FIG. , “# 2” node is stored linked to the element name “patent”.
[0069]
Thus, when indexing is performed at the parent node, the index file can be compressed. In other words, if indexing is performed at the parent node, the number of nodes to be linked to the element name does not increase because the parent node substitutes even if the number of child nodes increases.
[0070]
A data occurrence index is an index file that associates character string data stored in a structured document database with the position of a structured document (document object tree) in which the character string data exists. For example, in the structured document database of FIG. 8, the character string “XML” exists in the structured document under the “# 43” node and the structured document under the “# 49” node. In this case, in the data occurrence index, as shown in FIG. 10, the “# 43” node and the “# 49” node are linked to the character string “XML” and stored.
[0071]
The designated logical area in the document storage unit 5 is a storage location of the document designated by the user using the structured document path. The structured document path is an expression that can be recognized by the user.
[0072]
Returning to the description of FIG.
[0073]
The data access unit 4 performs various processes for accessing the document storage unit 5. The data access unit 4 includes a document object tree storage unit 41, a document object tree deletion unit 42, a document object tree acquisition unit 43, a document character string acquisition unit 44, a document parser unit 46, a composite document creation unit 47, and an index update unit 48. Composed.
[0074]
The document object tree storage unit 41 performs processing for storing a document object tree in a designated physical area in the document storage unit 5.
[0075]
The document object tree deletion unit 42 performs processing for deleting the document object tree existing in the designated physical area in the document storage unit 5.
[0076]
The document object tree acquisition unit 43 performs a process for acquiring a document object tree existing in a designated physical area (by a structured document path or the like) in the document storage unit 5.
[0077]
The document character string acquisition unit 44 performs processing for converting the document object tree into a structured document (XML document).
[0078]
The document parser unit 46 reads a structured document input by a user and inspects the document structure. Further, if there is a schema that is definition data of the document structure, it is verified whether or not the document structure of the input structured document conforms to the schema. The output result is a document object tree. A document parser can usually be constructed by combining a lexical analyzer (lexical analyzer generator) such as lex (performing lexical analysis and decomposing it into tokens) and a parser generator such as yacc (heteroother compiler).
[0079]
The composite document creation unit 47 must check whether it conforms to the schema when storing a document or deleting a document, but creates data necessary for this check.
[0080]
The index updating unit 48 updates the element name occurrence index and the data occurrence index shown in FIGS. 9 and 10 every time the stored contents of the structured document database are updated due to document storage or document deletion.
[0081]
The physical area in the document storage unit 5 is internal data indicating the location of unique document data in the structured document database such as file offset and object ID. The data cannot be recognized by the user.
[0082]
The search request processing unit 3 performs processing for searching for a document stored in the document storage unit 5 using each processing function unit provided in the data access unit 4. When the request receiving unit 11 of the request control unit 1 receives a document search request from the user, the search request processing unit 3 receives a query document described in the query language from the request receiving unit 11. Then, the index storage unit 6 and the document storage unit 5 are accessed through the data access unit 4, a set of documents that match the search request is acquired, and the result is output via the result processing unit 12.
[0083]
FIG. 2 shows one use form of the structured document management system shown in FIG. 1. In FIG. 2, the structured document management of the configuration shown in FIG. 1 is performed on the back end of the WWW (World Wide Web). The case where the system 100 is operating is shown.
[0084]
The WWW browser 103 is operating in each of a plurality (for example, three) of client terminals (for example, personal computers, portable communication terminals, etc.) 102. The user can access the structured document management system 100 by accessing the WWW server 101 from each client terminal. The WWW browser 103 and the WWW server 101 communicate with each other using HTTP (Hyper Text Transfer Protocol). Further, the WWW server 101 and the structured document management system 100 communicate with each other via CGI (Common Gateway Interface) or COM (Component Object Model).
[0085]
Requests from users, such as document storage, document acquisition, and document search, are transmitted from the WWW browser 103 and accepted by the structured document management system 100 through the WWW server 101. The result processed by the structured document management system 100 is returned to the requesting WWW browser 103 through the WWW server 101.
[0086]
The (1) storage function and (2) search function of the structured document management system in FIG. 1 will be described in detail below.
[0087]
(Storage function)
The storage system commands in the structured document management system of FIG.
[0088]
insertXML (path, Nth, XML): document storage
appendXML (path, XML): document storage
getXML (path): Document acquisition
removeXML (path): Delete document
setSchema (path, schema): Schema storage
getSchema (path): Schema acquisition
“InsertXML” is a command (hereinafter simply referred to as an insert command) that inserts a document at the Nth position below the structured document path specified in ().
[0089]
“AppendXML” is a command for inserting a document at the end of the structured document path specified in () (hereinafter simply referred to as an add command).
[0090]
“GetXML” is a command for retrieving a document below the structured document path specified in () (hereinafter simply referred to as an acquisition command).
[0091]
“RemoveXML” is a command (hereinafter simply referred to as a delete command) for deleting a document below the structured document path specified in () (a document other than a schema document, mainly a content document).
[0092]
“SetSchema” is a command (hereinafter simply referred to as a schema storage command) for setting a schema in the structured document path specified in ().
[0093]
“GetSchema” is a command for retrieving the schema set in the structured document path specified in () (hereinafter simply referred to as a schema acquisition command).
[0094]
Among the above commands, processing for the insert command, addition command, and schema storage command is executed by the document storage unit 21 of the access request processing unit 2, and processing for the acquisition command and schema acquisition command is executed by the document acquisition unit 22. Processing regarding the delete command is executed by the document deletion unit 23.
[0095]
With reference to FIG. 5, the case where an additional command is executed in the initial state of the structured document database (see FIG. 5A) will be described.
[0096]
As shown in FIG. 5A, with respect to the initial state in which the “# 0” node and the “# 1” node are connected by the “root” arc,
As a result of executing “appendXML (“ uix: // root ”,“ <patent DB /> ”)”, a “# 2” node and a “patent DB” arc are created as shown in FIG. .
[0097]
A case where an acquisition command is executed on the structured document database in the state shown in FIG.
[0098]
For example, when “getXML (“ uix: // root ”)” is executed, the document object tree below the “# 0” node indicated by the “root” arc in FIG. 5B is extracted and converted to an XML document. To do. As a result, a character string “<root><patent DB /></root>” is extracted and converted into an XML document as shown in FIG. The processing of the acquisition command is executed by the document acquisition unit 22 of the access request processing unit 2.
[0099]
Next, when an additional command for storing “patent” information as a content document (XML document) as shown in FIG. 3 is executed for the structured document database in the state shown in FIG. Will be described. That is, in this case, “appendXML (“ uix: // root / patent DB ”,“ <patent>... </ Patent> ”)” is executed. In this command, ““ <patent>... </ Patent> ”” corresponds to the “patent” information XML document shown in FIG.
[0100]
When the processing of the additional command is executed, as shown in FIG. 7, a document object tree (corresponding to FIG. 4) with the “# 42” node at the top is added below the “# 2” node.
[0101]
Assume that the following additional command is repeatedly executed three times for the structured document database in the state shown in FIG.
[0102]
"AppendXML (" uix: // root / patent DB ","<patent> ... </ patent>")"
In the above command, “<patent>... </ Patent>” corresponds to a content document having the same document structure as the XML document shown in FIG.
[0103]
Then, as shown in FIG. 8, a document object tree with “# 42” node, “# 52” node, and “# 62” node at the top is added below the “# 2” node.
[0104]
Next, a case where an acquisition command for extracting three pieces of “patent” information is executed on the structured document database in the state shown in FIG. 8 will be described. In this case, “getXML (“ uix: // root / patent DB ”)” is executed. Then, the document object tree below the “# 2” node indicated by the “patent DB” arc is extracted. As a result, as shown in FIG. 11, an XML document “<patent DB><patent> ... </ patent><patent> ... </ patent><patent> ... </ patent></ patent DB>” is acquired. it can.
[0105]
In the structured document database, data defining the document structure of the content document (XML document) such as the “patent” information, that is, the schema, is also managed.
[0106]
FIG. 12 shows an example of a schema that defines the document structure of an XML document. Here, XDR (XML-Data Reduced), which is one of XML document structure definition languages, will be taken up. Of course, other document structure definition languages such as XML-Schema may be used.
[0107]
The schema shown in FIG. 12 defines the document structure of the “patent” information shown in FIG. 3 in XDR. As can be easily seen from FIG. 12, the schema is also a structured document in the XML format. There is an element set that starts with a component starting with a “Schema” tag and starts with an “ElementType” tag as its child elements.
[0108]
A case where a schema storage command for storing the schema document shown in FIG. 12 is executed on the structured document database in the state shown in FIG. 8 will be described. In this case, “set Schema (“ uix: // root / patent DB ”,“ <Schema>... </ Schema> ”)” is executed. In this command, ““ <Schema>... </ Schema> ”” corresponds to the schema document shown in FIG.
[0109]
By executing the above command, as shown in FIG. 13, a “#schema” arc is added below the “# 2” node, and a document object tree having the “# 3” node as a top node is added to the “# 3” node. The Since the schema itself is expressed as an XML document, the tree is expanded as shown in FIG.
[0110]
In FIG. 13, arcs that start with “@” like “@name” correspond to attributes. Since the tag name “#schema” also begins with “#” and “@”, it cannot be used as a standard tag name in the XML standard.
[0111]
Since the schema document shown in FIG. 12 is stored below the “# 2” node, the document structure of the document stored below the “# 2” node is defined by the schema document shown in FIG. It may be required to conform to the document structure. That is, in this case, the schema shown in FIG. 12 is set below the “# 2” node.
[0112]
When the schema shown in FIG. 12 is set below the “# 2” node, for example, as shown in FIG. 14, each node (file) of the document object tree below the “# 2” node has the schema. An attribute value indicating that it exists is set.
[0113]
After the schema shown in FIG. 12 is set below the “# 2” node, the “patent” information document shown in FIG. 3 that matches the document structure defined in this schema is shown in FIG. As described above, when the document object tree is stored in the structured document database, an attribute value indicating that the schema shown in FIG. 12 exists in the document structure of this document is set in each document object constituting the document object tree. Is done. For example, “1” is set to an attribute value (for example, “schema conformity presence / absence”) indicating that a schema exists for each document object file constituting the document object tree. In FIG. 14, each document object (node) conforming to the schema is indicated by a double circle. Each document object indicated by a double circle has a document structure definition corresponding to the document object.
[0114]
FIG. 15 conceptually shows the contents of a file of each document object. For example, a file of a document object whose object ID is “# 42” relates to another document object linked to the document object. The attribute value is described together with information (for example, an arc or a pointer value to a linked document object). Note that when there is no schema to be applied to the document object, the value of “existence of schema conformance” is “0”.
[0115]
FIG. 16 and FIG. 17 show the structure of the concept hierarchy, which is the result of hierarchically classifying words used as keywords used as search conditions as needed from the semantic content in the structured document management system of FIG. An example expressed in a document is shown. The “concept” information shown in FIGS. 16 and 17 is a content document described in XML.
[0116]
The example of “concept” information shown in FIG. 16 represents an “information model” used as one classification axis for classifying the contents of a patent document in a so-called patent search in a concept hierarchy. “Concept” information surrounded by “concept” tags has a document structure with a nested structure. That is, in the example of FIG. 16, there are a concept “document”, a concept “relation”, and a concept “object” as child concepts of the concept “information model”. Further, as a child concept of the concept “document”, there are a concept “structured document” and a concept “unstructured document”. Furthermore, as a child concept of the concept “structured document”, there are a concept “XML” and a concept “SGML”.
[0117]
The description example of the “concept” information illustrated in FIG. 17 represents a classification axis “information operation” different from that in FIG. In the example of FIG. 17, there are a concept “search”, a concept “store”, a concept “processing”, and a concept “distribution” as child concepts of the concept “information operation”.
[0118]
The “concept” information as shown in FIGS. 16 and 17 can also be stored in the structured document database in the same manner as the “patent” information described above. That is, for example, first, “appendXML (“ uix: // root ”,“ <concept DB /> ”)” is executed on the structured document database in the state shown in FIG. Thus, a “# 201” node and a “concept DB” arc are created. In this state, when “concept” information shown in FIG. 16 is stored, “appendXML (“ uix: // root / concept DB ”,“ <concept name>... </ Concept> ”)” is executed. . In this command, ““ <concept name>... </ Concept> ”” corresponds to the “concept” information shown in FIG.
[0119]
When the processing of the addition command is executed, as shown in FIG. 19, a document object tree with the “# 202” node at the top is added below the “# 201” node.
[0120]
As described above, in the structured document management system in FIG. 1, a large number of XML document groups (content documents, schema documents, query documents, etc.) having different document structures registered in the structured document database are displayed. 18. As shown in FIG. 19, it is handled as one large XML document in the form of a tree having a “root” tag at the head. Therefore, to access a partial XML document, it is possible to search and process a wide range of XML documents by using a unified access means that does not depend on the document structure, which is a path to a large XML document. Become.
[0121]
In addition, by setting a schema in a part of the structured document database, the validity of whether or not the document structure of the document to be stored matches the document structure defined by the schema is automatically checked. You may make it carry out.
[0122]
(Search function)
The search commands in the structured document management system of FIG.
[0123]
query (ql)
“Query” is a command (hereinafter referred to as a search command) that executes the query ql in () as a parameter and acquires the resulting XML document.
[0124]
For example, as shown in FIG. 20, the query is an XML document having a document structure in which a search position, a search condition, an information extraction part, and the like are described in a language similar to SQL (Structured Query Language). The query document is also a management target of the structured document management system.
[0125]
The element starting from the “kf: from” tag has a description for associating a variable with the specification of the search position and the value of the document element, and the element starting from the “kf: where” tag has a description of conditioning regarding the variable , The output format of the search result is described in the element starting from the “kf: select” tag.
[0126]
Search includes simple search and concept search. Simple search searches and extracts information that satisfies the search conditions specified in the query, and concept search uses the concept information specified in the query to search specified in the query. It searches and extracts information that satisfies the conditions.
[0127]
FIG. 21 shows an example of a simple search query. The query shown in FIG. 21 corresponds to, for example, “1999” in the “patent” information document group stored below the node indicated by the “patent DB” arc for the structured document database shown in FIG. In addition, a description example of a search request “enumerate“ titles ”of a document (“ patent ”information) having an element“ summary ”having a content such as“ PC ”) is shown.
[0128]
Document elements of “Title”, “Year”, and “Summary” of “Patent” information are respectively added to the variables “$ t”, “$ y”, and “$ s” by the description of the element that starts with the “kf: from” tag. The value of is assigned.
[0129]
The variable “$ y” = “1999” is compared based on the description of the element starting from the “kf: where” tag. The component “MyLike” is a function for detecting the variable “$ s” having a value similar to “PC” with the variables “$ s” and “PC” as arguments.
[0130]
The variable “$ t” is used as the output value by the description of the element starting from the “kf: from” tag.
[0131]
The “kf: star” tag is an ambiguous expression of the structure. For example, “<patent><kf:star><year>” exists as a descendant element of an element whose tag name is “patent”. And an element whose tag name is “year”.
[0132]
FIG. 22 shows a search result using the simple search query of FIG. This search result is also an XML document.
[0133]
FIG. 23 shows an example of a query for concept search. For example, the query in FIG. 23 is performed on the “patent” information document group stored below the node indicated by the “patent DB” arc with respect to the structured document database in the state illustrated in FIGS. 18 and 19. A description example of a search request for searching using “concept” information stored below the node indicated by the “concept DB” arc is shown. Here, the values of the child elements of the tag having the value of the concept “peripheral device” include the concepts “SCSI”, “memory”, “HDD”, and the like. Further, although not shown in FIG. 18, it is assumed that elements of each “patent” information include elements starting from “keyword” tags.
[0134]
That is, the query of FIG. 23 is a description example of a search request “list enumeration of“ titles ”of“ patent ”information having any one of the concepts below the concept“ peripheral device ”as an element value of a component“ keyword ”. Is shown.
[0135]
By the description of the element starting from the “kf: from” tag, the values of the elements “title” and “keyword” of the “patent” information are substituted into the variable “$ t” and the variable “$ k”, respectively. The variable “$ x” is substituted with the values of tag child elements (“SCSI”, “memory”, “HDD”, etc.) having the value of “peripheral device” as “concept” information.
[0136]
A comparison of “$ k” = “peripheral device” or “$ k” = “$ x” is made based on the description of the element starting from the “kf: where” tag.
[0137]
Next, the document search processing operation of the structured document management system of FIG. 1 will be described with reference to the flowchart shown in FIG.
[0138]
On a predetermined display device of the client terminal, a screen as a user interface as shown in FIG. 25 provided by the structured document management system 100 (for example, the request control unit 1) is displayed.
[0139]
When the user selects “XML search Win” using a pointing device such as a mouse on the screen shown in FIG. 25, a screen as a user interface for performing a document search shown in FIG. 26 is displayed. .
[0140]
In the search screen of FIG. 26, in the area W1, element names (tag names) of the components of the current tree structure of the structured document database are displayed in a simplified manner so that the user can understand them. In FIG. 26, only the element names of the upper hierarchy are displayed, but it is possible to display up to the end element names.
[0141]
The area W11 is an area for inputting a search target range (a search range on the tree structure), a search condition, and the like. In the area W12, the search result is displayed.
[0142]
For example, among the documents having “patent” under “uix: // root” as the first tag, the element value of the component having the “title” tag includes the character string “document”, and “1998” In the case of a search request “search for a document created thereafter”, “root” is selected from the area W1 with a mouse or the like, and a structured document path is input as a search target range. In the area W11, first, “patent” is input as the top node (in this case, “patent” may be input from the area W1 by selecting with a mouse or the like). Further, the contents of “the element value of the component“ title ”includes the character string“ document ”” and “the element value of the component“ year ”is“ 1998 ”or more” as the search condition are stored in advance. What is necessary is just to input into the provided data input area.
[0143]
Thereafter, by selecting the “search” button B21, for example, a query as shown in FIG. 27 is transmitted to the structured document management system together with an additional command for storing the query on the structured document database. The storage location of the query is predetermined, and the system side automatically sets the parameter of this additional command. For example, when the structured document database is in the state shown in FIG. 18, the structured document path as a parameter indicating the storage location of the query is “uix: // root / query DB”. The other parameter of the additional command is the query document.
[0144]
When receiving the query (step S100), the request receiving unit 11 passes the query to the search request processing unit 3. Then, the parameter of the additional command for storing the query document is passed to the document storage unit 21. The document storage unit 21 performs an additional command process, and the query is stored in the document storage unit 5 (step S101).
[0145]
On the other hand, the search request processing unit 3 accesses the index storage unit 6 and the document storage unit 5 through the data access unit 4 based on the received query, acquires a document set that matches the search request, etc. The requested information is extracted and output via the result processing unit 12.
[0146]
For example, in the case of the above query, first, it is efficient to narrow down the search target by searching for items that match the condition that the element value of the component having the “title” tag includes the character string “document”. Good. Therefore, the object ID of the node (document object) linked to the character string “document” is obtained using the data occurrence index as shown in FIG. Then, for each of them, the document object tree is moved up one upstream, and when the tag name “title” is reached, further upstream is reached. When the tag name “patent” is reached, the node The following document object tree Ot11 is extracted.
[0147]
Next, a document object tree Ot12 whose component value “year” is “1998” or more is further extracted from the extracted document object trees Ot11.
[0148]
This document object tree Ot12 is a document that matches the contents of the query. Further, according to the request contents of the query, a structured document path to the top node of each document object tree Ot12 is obtained (step S102).
[0149]
The search process is not limited to the above-described method, and various efficient search methods using index information are possible.
[0150]
The search request processing unit 3 integrates the results obtained in step S102 and creates an XML document as a search result (step S103).
[0151]
For example, the XML document of the search result is
<Out>
<Result>
ux: // root / patent DB / patent [0]
</ Result>
<Result>
ux: // root / patent DB / patent [2]
</ Result>
</ Out>
It becomes.
[0152]
The search request processing unit 3 returns the XML document together with the style sheet to the requesting client terminal via the search result processing unit 12 (step S104).
[0153]
In the client terminal, the XML document shown in FIG. 11 is converted into HTML data using a style sheet, and displayed in the area W12, for example, as shown in FIG. Here, for example, since the number of structured documents obtained as a search result is large, the structured document path of the searched structured document is displayed as the search result. In this case, for example, it is assumed that a desired one of the search result structured document paths displayed in the region W12 of FIG. 26 is selected by the user. For example, it is assumed that the first one of the structured document paths displayed in the area W12 in FIG. 26 is selected. In this case, an acquisition command may be transmitted as a document acquisition request from the client terminal to the structured document management system in order to acquire a structured document specified by the selected structured document path.
[0154]
The document acquisition processing operation of the structured document management system of FIG. 1 when the acquisition command is received by the request reception unit 11 of the structured document management system will be described with reference to the flowchart shown in FIG.
[0155]
For example, an acquisition command “getXML (“ uix: // root / patent DB / patent [0] ”)” is transmitted to the structured document management system.
[0156]
Here, for example, when the structured document database is in the state shown in FIG. 8, an acquisition command “getXML (“ uix: // root / patent DB / patent [0] ”)” is received as an example I will explain to you.
[0157]
When receiving the acquisition command, the request reception unit 11 passes the structured document path “uix: // root / patent DB / patent [0]”, which is a parameter in the acquisition command, to the document acquisition unit 22 (step S31). ).
[0158]
The document acquisition unit 22 passes the structured document path to the document object tree acquisition unit 43. The document object tree acquisition unit 43 specifies a physical area in the document storage unit 5 from the structured document path, and thereby displays a node (document object Ox5) represented by the structured document path existing in the area. Take out (step S32). If the structured document path is specified correctly, the object ID of the document object Ox5 can be acquired (step S33). In this case, the process proceeds to step S35.
[0159]
For example, in the case of the above acquisition command, since the “# 42” node becomes the document object Ox5, “# 42” is acquired as the object ID, and the document object tree Ot5 (“##” below this “# 42” node is acquired. 42 "node to"# 49 "node) are acquired (step S35).
[0160]
In step S32, if the corresponding document object Ox5 is not found from the designated structured document path, an error occurs (step S33), and the document acquisition unit 22 and the result processing unit 12 notify the client terminal that “document acquisition failed. Is returned (step S34).
[0161]
The document object tree Ot5 acquired in step S35 is converted into an XML document by the document character string acquisition unit 44. For example, in the case of the above acquisition command, the acquired XML document is an XML document of “patent” information as shown in FIG.
[0162]
The document acquisition unit 22 returns the XML document as shown in FIG. 3 (for example, with a predetermined style sheet such as XSL (extensible Style Language)) to the client terminal via the result processing unit 12 (step S37).
[0163]
The client terminal converts the XML document shown in FIG. 3 into HTML data using a style sheet, and displays it in the area W13 as shown in FIG. 29, for example.
[0164]
Using XSL, XML documents can be converted into various forms. It can be converted into an XML document with a different syntax, and an HTML page can be generated from the XML document.
[0165]
Similarly, a schema can be searched.
[0166]
For example, in the case of a search request “search for a schema having tag names“ patent ”and“ summary ”from documents having“ schema ”below“ uix: /// root ”as the first tag, As shown in FIG. 30, “root” is selected from a region W1 with a mouse or the like, and a structured document path is input as a search target range. Then, “#schema” is input as the top node. In addition, as a search condition, the contents of “include the character string“ patent ”in the attribute name of the component” and “include the character string“ summary ”in the attribute name of the component” are input in the data input area provided in advance. do it.
[0167]
Thereafter, by selecting a “search” button B21, a query describing the search request, for example, as shown in FIG. 31 is structured with an additional command for storing the query on the structured document database. Sent to the document management system.
[0168]
In the case of the above query, for example, a search is made that matches the condition of “having“ #schema ”as the first tag”. Therefore, using the element name occurrence index as shown in FIG. 9, the object ID of the (document object) of the node linked to the element “#schema” is obtained. Then, for each of them, the arc is traced downstream in the document object tree, and when the attribute name reaches an element of “patent” and “summary”, the document object tree having “#schema” as the first tag Extract Ot21. This document object tree Ot21 is a document that matches the contents of the query. Further, according to the request content of the query shown in FIG. 31, a structured document path to the top node of each document object tree Ot21 is obtained.
[0169]
If there are a plurality of document object trees Ot21, the search request processing unit 3 collects structured document paths to the respective top nodes, creates an XML document as a search result, and passes the search result processing unit 12 through the search result processing unit 12. The XML document is returned to the requesting client terminal together with the style sheet.
[0170]
In the client terminal, the XML document received as the search result is converted into HTML data using a style sheet, and displayed in the area W12 as shown in FIG. 26, for example.
[0171]
In the client terminal, when one schema in the search result is selected and displayed, for example, the “patent” information is displayed in the area W3 together with a screen for storing / deleting a document as shown in FIG. The data input area is set and displayed for each element.
[0172]
The user can easily create a stored document having a document structure defined by the schema by inputting data in the data input area.
[0173]
For example, when “patent DB” is selected in the area W1 using a mouse or the like as the storage destination of the “patent” information input in the area W3 in FIG. 32, the structured document path in the area W2 is “uix: /// root”. / Patent DB ”is displayed. Thereafter, when the “Register” button B1 is selected, an additional command “appendXML (“ uix: // root / patent DB ”,“ <patent>... </ Patent> ”)” is transmitted to the structured document management system. .
[0174]
As described above, in the structured document management system in FIG. 1, a large number of XML document groups (content documents, schema documents, query documents, etc.) having different document structures registered in the structured document database are displayed. 18. As shown in FIG. 19, it is handled as one large XML document in the form of a tree having a “root” tag at the head. Accordingly, it is possible to easily search for a document that matches the search condition from among a large number of documents having different schemas and different schemas.
[0175]
Further, since the query used for the search is also a structured document, an application that reuses a past query can be easily constructed by storing the query in the structured document database as a log.
[0176]
(Document structure search)
The configuration and processing operation for document structure (schema) search according to the embodiment of the present invention will be described below in addition to the basic configuration and processing operation of the structured document management system described above. .
[0177]
As illustrated in FIG. 1, the structured document management system 100 includes a schema search processing unit 303 and a template processing unit 306 in the data access unit 4 for performing a schema search, and further includes a template storage unit 404. And a semantic network storage unit 405.
[0178]
The template processing unit 306 creates a template tree representing the logical structure in order to manage the hierarchical logical structure of the structured document database (hereinafter also referred to as an XML database). This template tree is updated each time a new XML document is stored.
[0179]
The template tree created here is stored in the template storage unit 404.
[0180]
The client terminal 102 transmits a schema search request statement (query) to the structured document management system 100. This schema search request statement is received by the request receiving unit 11 of the structured document management system 100, and the schema search request statement is transferred to the schema search processing unit 303 from here.
[0181]
The schema search processing unit 303 is for searching a schema based on a schema search request sentence while referring to a semantic network stored in the semantic network storage unit 405 and a template stored in the template storage unit 404. .
[0182]
(1) Updating the template tree
First, a template tree update processing operation at the time of document storage in the template processing unit 306 will be described.
[0183]
As described above, when the document storage unit 21 executes the add command and registers the registration target XML document in the document storage unit 5, the document storage unit 21 stores the registration target XML document on the database where the registration target XML document is stored. And the information on the document structure of the XML document to be registered.
[0184]
FIG. 33 shows a configuration example of the template processing unit 306.
[0185]
The template processing unit 306 first extracts the structure information of the XML document such as the element name and attribute name of the component from the document structure information of the XML document to be registered in the structure extraction unit 306a. Based on this and the structured document path, the template tree processing unit 306b updates the template tree representing the logical structure in the XML database stored in the template storage unit 404.
[0186]
The template tree is obtained by extracting the document structure of the XML document stored in the XML database and forming a tree according to the hierarchical structure of the XML database. Each node (each component) is assigned with a template ID and stored. The As described above, the XML document to be registered is stored in the document storage unit 21 with these object IDs added to the respective constituent elements, but the template ID is also added and stored. ing. The element IDs and attribute names of all the constituent elements constituting the document structure of the XML document to be registered are assigned with this template ID and stored on the template tree.
[0187]
For example, consider a case where a template tree as shown in FIG. 34 is stored in the template storage unit 404 before the update. In this case, the state of the XML database is the same as in FIG. At this time, the XML document 601 as shown in FIG. 35A is registered in the XML database under “/ document DB / patent DB”, and the XML document 602 as shown in FIG. Consider the case of registering under “DB / employee DB”. Here, the expression “/ document DB / patent DB” is the same as the structured document path used when expressing the position of the component in the XML database, and here, the position of the component on the template tree. It is also used to express
[0188]
The template tree stores at least a logical structure including a document structure of each XML document stored in the XML database and a storage position on the hierarchical structure of the XML database.
[0189]
Here, the template ID given to the element name and attribute name of each component is the codes TID1 to TID13 assigned to the element name (tag name) and attribute name of each component in FIGS. Are the same.
[0190]
In this case, since all elements are new elements as nodes on the template tree, the template tree is updated as shown in FIG.
[0191]
Further, when a plurality of XML documents having the same document structure or part of the same document structure are stored in storage positions represented by the same structured document path, 2 in the template tree (logical structure). Constituent elements common to two or more XML documents are aggregated into one and expressed as one constituent element. In this case, it is necessary to manage the correspondence between the template ID of the component on the template tree and the object ID of each component on the XML database. For this purpose, for example, the template ID of each component and the object ID of the component in the XML document may be recorded on the template tree, or the correspondence between the template ID and the object ID is recorded. A separate table may be provided.
[0192]
In practice, the template tree is stored in the template storage unit 404 as an XML document as shown in FIG. This is referred to herein as a template tree document.
[0193]
As shown in FIG. 37, the template tree document is, for example, an XML document hierarchized with components having “xsp: Node” as the element name at the top. As described in the template tree document, when an XML document is registered in the XML database, “document” is used for the root node (element name of the first component) of the XML document to be registered. “Element” is added to the element name of each element other than that, and “attribute” is added to the element representing the attribute as the attribute value of the type attribute of “xsp: Node”.
[0194]
In the case of creating a folder, “folder” and “rootElement” are added as type attributes as the root of the XML database. Further, among the constituent elements in the XML document to be registered, the constituent elements that are registered each time are highly likely to be essential in the document structure, so record that they are essential with the attribute “required”. (If it is essential, the attribute value of the attribute is “yes”). In addition, simple schema extraction is performed in which one or more elements appear on the same document with no limit on the number of times (maxOccurs = “*”). It is also possible to write a candidate value (“dataType”) by performing a type estimation step. This is passed if a non-candidate value comes to this element.
[0195]
Further, the actual number of components (number of object IDs) aggregated into one component on the template tree may be described as a count attribute. Note that FIG. 37 shows only the attributes of count and type.
[0196]
When an add command is executed in the document storage unit 21 to register an XML document to be registered, an index corresponding to the XML document is created as described above.
[0197]
(2) Outline of schema search processing
The client terminal 102 transmits a schema search request statement to the structured document management system 100. This schema search request statement is received by the request receiving unit 11 of the structured document management system 100, and the schema search request statement (schema query) is passed from here to the schema search processing unit 303.
[0198]
FIG. 38 shows a configuration example of the schema search processing unit 303.
[0199]
The schema search processing unit 303 includes a schema query analysis unit 321, a schema query condition processing unit 322, and a schema query output processing unit 333.
[0200]
The schema query analysis unit 321 analyzes a search request statement in which a search condition for searching a schema is described, that is, a schema query. Examples of the query description language include XQuery. FIG. 39 shows a specific example of the schema query. In addition, since the schema search needs to create data by estimating the schema and perform the search, the schema query cannot be expressed by ordinary XQuery or the like.
[0201]
The schema query starts with a component having an “xsp: query” tag, and is mainly described in a component having a tag of “xsp: rerun” (hereinafter also simply referred to as a “return clause”). And a portion (hereinafter also simply referred to as a where clause) described by the components of the tag “xsp: where”.
[0202]
Since the “xsp: query” tag at the head is present, the request reception unit 11 can determine that the query is a schema query.
[0203]
In the component having the “xsp: return” tag, a character string (keyword) corresponding to the element name or attribute name of the first component of the document structure to be extracted is described. All elements in the “xsp: return” clause are “xsp: elem”. A plurality of candidates may be described here. For example, an element name and an attribute name that are assumed to be closely related are described with a name attribute. In the schema query shown in FIG. 39, “patent” and “document” are described.
[0204]
In this case, ambiguous fluctuations other than perfect matching are also considered. This is because a schema which is not described as a “patent” schema but is defined by “Patent”, for example, is similarly searched.
[0205]
As a means for absorbing this fluctuation, the semantic network storage unit 405 stores a semantic network as shown in FIG. As shown in FIG. 41, the semantic network is a graph representing the relationship between vocabularies, and weights (similarities) between vocabularies are added on the arc. For example, the similarity of “structured document” and “XML” is “0.8”. The numerical value representing the similarity takes a value from “0” to “1”.
[0206]
FIG. 42 shows an algorithm for calculating the similarity.
[0207]
Here, a case where a vocabulary similar to “XML” is obtained while calculating the similarity will be described with reference to the flowchart shown in FIG.
[0208]
Here, a threshold value of similarity that can be determined to be similar to the vocabulary character string “XML” is, for example, “0.6”. First, the position on the semantic network of “XML” is searched (step S201). The search position weight (similarity) is set to “1.0” (step S202). Thereafter, first, the current keyword is added to the search candidate list and the fixed candidate list (step S203).
[0209]
The vocabulary “structured document” and “markup language” similar to “XML” are extracted from the semantic network shown in FIG. 41, and the similarity is “0.8”. In this case, “0.8” is obtained by multiplying the weight “1.0” of the expanded vocabulary “XML” by the weight of each extracted vocabulary. The expansion source vocabulary “XML” is deleted from the search candidate list, and these two vocabularies are added (step S205).
[0210]
Since the similarity between the two vocabularies is “0.8”, it is also added to the fixed candidate list (steps S206 to S207).
[0211]
Next, when the “structured document” on the search candidate list is used as the expansion source and the semantic network shown in FIG. 41 is expanded by one stage, the similarity of the “structure document” is similar to “0” from the expansion source. .5 ”and the expansion source weight“ 0.8 ”are multiplied to obtain“ 0.4 ”. Since the threshold is set to “0.6” now, no further development is performed. By repeating the above operations, as a result, “XML” (1.0), “Markup Language” (0.8), “SGML” (0.64), “HTML” (0.64), “ The “semi-structured document” (0.64) will remain on the confirmed candidate list.
[0212]
In the schema query shown in FIG. 39, the threshold value is described as an attribute of “xsp: elem” in order to represent the threshold value of the similarity. That is, here, the attribute name “sim” is used to describe the attribute value of the sim attribute. If a threshold value is set by this attribute value, development on the semantic network for obtaining a vocabulary having a similarity degree equal to or less than the threshold value defined here can be suppressed. The threshold may be set when the user inputs a search condition, for example. However, the threshold is not limited to this, and a default value written in advance on the schema query may be used. Further, even if not described in the schema query, a value predetermined by the schema search processing unit 303 may be used.
[0213]
When searching the schema, you may not even know the extracted part. In this case, if “<xsp: return type =“ all ”/>” is described in the schema query, the schema search processing unit 303 automatically determines the root node of the document (that is, as described above). On the template tree, since the attribute value “document” is described in the first component of one XML document, the root node can be determined by checking this) The document structure starting from the root node can be taken out (pseudo schema is created) and returned.
[0214]
Returning to the description of the schema query in FIG. 39, a component having an “xsp: where” tag describes a character string (keyword) included in the element name or attribute name of the component as a search condition. In this section, the search conditions are described while leaving the ambiguous state.
[0215]
For example, “<xsp: elementtype =“ element ”/>” when the element name or attribute name is not known, and “<xsp: element type =“ word ”/> when the element value is unknown as an attribute value. > ”. In some cases, the element name, the attribute name, the element value, or the attribute value may not be known at all. In this case, “<xsp: elem type =“ all ”/>” is described.
[0216]
That is, when a certain character string is specified by the user, if it is used as a search condition for searching a schema having the element name (or may be an attribute value), “<xsp: element type =“ element ”/ > ”And when the search condition is used to search for a schema having the character string as an element value (or attribute value),“ <xsp: elem type = “word” /> ” When a search condition for searching a schema having a column in either one of element name (or attribute name) and element value (or attribute value) is used, “<xsp: elem type =“ all ”/>” Is described.
[0217]
The name representing the schema is described by the name attribute. Here, in the schema query shown in FIG. 40, it is possible to perform a search specifying “int” and an integer type in the dataType attribute or the like. Similarly, the closeness of expansion is described by sim.
[0218]
Next, a schematic processing flow of schema search in the schema search processing unit 303 will be described with reference to a flowchart shown in FIG.
[0219]
The schema query in the schema search processing unit 303 first performs a where clause process (step S301), then performs a return clause process (step S302), and then merges the processing results (step S303). Then, a schema estimation process is performed on the result, that is, a schema is extracted (step S304), a sort process based on the scoring process of each extracted schema is performed (step S305), and finally the output format is formatted. (Step S306), and returns the result.
[0220]
Next, the schema search will be described in more detail with a specific example.
[0221]
(3) Specific examples
The client terminal 102 displays a search condition input screen for schema search as shown in FIG. On this input screen, it is assumed that the user has input, for example, a natural sentence having the content “a schema in which an author is described in an XML-related patent” as a schema to be searched.
[0222]
In the client terminal 102, the input content is converted into a schema query as shown in FIG. 45 by natural sentence analysis, for example, and transmitted to the structured document management system 100.
[0223]
In this example, the user inputs a natural sentence for schema search. However, the present invention is not limited to this. For example, (Condition 1) What character string is an element name (or attribute value), element value ( Alternatively, it is not necessary to input a natural sentence as long as the contents include whether it is included as (attribute value) or (condition 2) what kind of document structure is desired to be extracted. In addition, separate input areas for inputting the conditions 1 and 2 may be provided. Alternatively, only condition 1 may be input. When the condition 1 is input, it is sufficient that at least one character string is input, and a plurality of character strings may be input while being separated by a comma or the like.
[0224]
The character string specified as condition 1 may be an element name of an element, an attribute name, an element value, or an attribute value. You may make it input from an input area, or provide only one input area and create a schema query by predetermining that the character string input there is any one of them. May be. Also, the character string input as condition 2 may be an element name of an element or an attribute name. These may be input from separate input areas, or one input Only a region may be provided, and a character string input therein may be determined in advance as being one of these, and a schema query may be created.
[0225]
Moreover, only condition 1 may be sufficient, and only condition 2 may be sufficient.
[0226]
In the client terminal 102, a plurality of types of schema query templates that can be completed by substituting the character strings corresponding to the conditions 1 and 2 (for example, the schema query and the condition 2 when the conditions 1 and 2 are specified) Schema query when only one is specified, schema query when multiple character strings are specified in condition 2, etc.), and select one of them according to the input (specified) condition Then, a schema query may be created by substituting a character string corresponding to condition 1 or condition 2 into the blank part.
[0227]
Further, the client terminal 102 may transmit the character string or natural sentence input by the user corresponding to the above condition 1 or 2 as a search request to the structured document management system as it is. In this case, the schema query may be generated in the request receiving unit 11 of the structured document management system that has received such a search request. The same applies to the case where the request reception unit 11 creates a schema query.
[0228]
In the case of the input content shown in FIG. 44, the character strings corresponding to condition 1 are “XML” and “book”, the former is a character string included in the element value, for example, and the latter is, for example, an element name or attribute It is a character string included in the name. Further, the character string corresponding to the condition 2 is “patent”, and for example, it is determined that the character string is included in the element name, and as a result, a schema query as shown in FIG. 45 is created.
[0229]
The schema query is passed to the schema search processing unit 303 via the request reception unit 11 of the structured document management system 100. Alternatively, it is created by the request receiving unit 11 and passed to the schema search processing unit 303.
[0230]
First, the schema query analysis unit 321 analyzes the where section of the schema query (step S301 in FIG. 43). This processing operation will be described with reference to the flowchart shown in FIG.
[0231]
In the “where” section, a character string corresponding to the condition 1 describes a component of an element name (tag name) “xsp: elem” one by one. When representing a character string (for example, “book” in this case) included in the element name of the component (or one of the element name and attribute name), “type =“ element ” "" Is specified. When a character string (for example, “XML” in this case) included in the element value (or any one of the element value and the attribute value) of the constituent element is represented, “type = “Word” is designated. Also, it is included in either the element name of the component (or one of the element name and attribute name) and the element value of the component (or one of element value and attribute value) When representing a character string, “type =“ all ”” is designated for the character string.
[0232]
First, components in the where clause are taken out one by one (step S401), and it is checked whether each element is one of the above (steps S402 to S404).
[0233]
In the case of the schema query shown in FIG. 45, since the first element describes “type =“ element ”” for the character string “author” (step S402), the process proceeds to step S405. Here, a process of searching for a component having a character string of “book” and a character string of a vocabulary similar to this as an element name (matching or including the element name) is performed (structure estimation process).
[0234]
In the second element, since “type =“ word ”” is described for the character string “XML” (step S403), the process proceeds to step S406. For example, here, “XML” is used. And a component having a character string of a vocabulary similar to this as an element value (matching the element value or included in the element value) is searched (vocabulary estimation processing).
[0235]
If there is a character string described as “type =“ all ”” as a component in the where clause (step S404), the character string and the character string are added to the character string in the same manner as in steps S405 and S406. Processing for searching for a component having a similar character string as an element name or element value is performed (steps S407 and S408).
[0236]
The processes in steps S401 to S408 are repeated for all the components in the where clause.
[0237]
Here, taking the case where the character string to be processed is “author” as an example, the structure estimation processing in steps S405 and S407 will be described with reference to the flowchart shown in FIG.
[0238]
First, a synonym of “book” is extracted using a semantic network (step S411). Here, since the similarity threshold value is not specified (the sim attribute is not described), the network is expanded until it becomes equal to or less than a predetermined threshold value (here, set to “0.6”). . In the case of the semantic network shown in FIG. 41, “author” (similarity is “1.0”), “author” (similarity is “0.8” × “0.8”), “name” (similarity) The degree is “0.7”).
[0239]
Next, the element name occurrence index stored in the index storage unit 6 is searched in the same manner as in the document search described above, and each of the character string to be processed and the extracted similar word is changed to the element name. Is acquired, and further, for example, based on the description on the table or template tree as described above, the template ID corresponding to each of the acquired object IDs is acquired (step S412). .
[0240]
Then, an evaluation value (first evaluation value) is obtained for each template ID (step S413). The first evaluation value may be anything as long as it evaluates how much the component corresponding to the acquired template ID is similar to the character string (vocabulary) specified as the search condition.
[0241]
For example, when the current XML database is in a state as shown in FIG. 46, the template IDs acquired in step S412 are “TID7”, “TID15”, and “TID57”, and these evaluation values are For example, by using the vocabulary similarity included in the element name as it is, for example, the evaluation value of “TID7” is “1.0”, the evaluation value of “TID15” is “0.7”, and “TID57 The evaluation value is “0.64”. Each template ID and its evaluation value are recorded in the list L1. At this time, for each of the acquired template IDs, the object ID corresponding to the template ID and the number thereof are also recorded in the list L1.
[0242]
Next, taking the case where the character string to be processed is “XML” as an example, the vocabulary estimation processing in steps S406 and S408 will be described with reference to the flowchart shown in FIG.
[0243]
First, a similar word “XML” is extracted using a semantic network (step S421). In this case, “XML” (similarity “1.0”), “SGML” (similarity “0.72”), and “structured document” (similarity “0.81”) are obtained (step S421). .
[0244]
Next, in the same manner as in the above-described document search, the data occurrence index stored in the index storage unit 6 is searched, and each character string to be processed and each vocabulary of the extracted similar words are searched. The object ID of the constituent element included in the element value is acquired, and further, for example, based on the description on the table or template tree as described above, the template ID corresponding to each of the acquired object IDs is acquired (step S422).
[0245]
Then, an evaluation value (second evaluation value) is obtained for each template ID (step S423). This second evaluation value indicates how similar the component corresponding to the acquired template ID is to the character string (vocabulary) specified as the search condition, and the character string (vocabulary) specified as the search condition. As long as it is a component that evaluates how many of the above and similar words are included, it may be anything.
[0246]
The component corresponding to each acquired template ID may have some or all of the plurality of vocabularies as element values. In addition, a plurality of object IDs of components corresponding to the acquired template ID may be acquired from the data occurrence index.
[0247]
Here, for example, for each acquired template ID, the evaluation value is given based on which vocabulary of the plurality of vocabularies includes the element value corresponding to the component ID of the template ID. Alternatively, the number of object IDs (obtained from the data occurrence index) corresponding to each template ID may be obtained by multiplying the lexical similarity.
[0248]
For example, when the current XML database is in a state as shown in FIG. 46, in step S422, a component (template ID “TID6”) having an element name “title”, and “TID17” and “TID58” are also added. Template ID is obtained.
[0249]
The component “title” of the template ID “TID6” is searched from the vocabulary “XML” (similarity is “1.0”) and the vocabulary “SGML” (similarity is “0.72”). Therefore, for example, the similarity of these vocabularies is added to give 1 + 0.72 = 1.72 as the evaluation value. For example, if the number of object IDs searched with the vocabulary “XML” is 100 and the number of object IDs searched with the vocabulary “SGML” is 50, 1 × 100 + 0.72 × 50 = 136 It may be an evaluation value. In the following description, the calculation of the evaluation value is assumed to be the former example for simplicity of description.
[0250]
Since the component of the template ID “TID17” is searched only from the vocabulary “XML”, the similarity “1.0” of this vocabulary is given as an evaluation value.
[0251]
Since the component of the template ID “TID58” is searched only from the vocabulary “XML”, the similarity “1.0” of this vocabulary is given as an evaluation value.
[0252]
Each template ID and its evaluation value (second evaluation value) are recorded in the list L2. At this time, for each of the acquired template IDs, the object ID corresponding to the template ID and the number thereof are also recorded in the list L2.
[0253]
The following template ID is obtained from the processing of the where clause. In the parentheses, the second evaluation value is indicated by a number starting from “D”, and the first evaluation value is indicated by a number starting from “S”.
[0254]
{TID6 (D1.72), TID7 (S1.0), TID15 (S1.0), TID17 (D1.0), TID57 (S0.64), TID58 (D1.0)}
Next, the schema query analysis unit 321 analyzes the return clause of the schema query (step S302 in FIG. 43). This processing operation will be described with reference to the flowchart shown in FIG.
[0255]
In the return section, the element name and attribute name of the first component of the document structure to be extracted are described one by one in the component of the element name (tag name) “xsp: elem”. In the schema query shown in FIG. 45, “patent” is designated as the character string included in the element name.
[0256]
First, elements (components having an element name “xsp: elem”) are extracted one by one from the return clause (step S431), and a character string described in the element, for example, “patent” in this case, is processed. As a character string, structure estimation processing is performed as shown in FIG. 55 (step S432). The above steps S431 to S432 are performed for all elements described in the return clause (step S433).
[0257]
In the structure estimation process, similar words to those described above are first extracted using the semantic network.
[0258]
In this case, “patent” (similarity “1.0”) and “patent” (similarity “0.8”) are extracted.
[0259]
For example, when the current XML database is in a state as shown in FIG. 46, when a component having one of the two vocabularies as an element name is obtained from the element name occurrence index, the following three template IDs are obtained. can get.
[0260]
{TID4 (S1.0), TID55 (S0.8), TID72 (S1.0)} In FIG. 46, components (nodes) corresponding to the three template IDs are circled.
[0261]
Next, the process of step S303 in FIG. 43 is performed. That is, the component that is the head of the document structure of one XML document is extracted from the template tree from the component obtained as a result of processing the where clause and the component obtained as a result of processing the return clause. Specifically, the parent-child relationship on the template tree between the component extracted by the process of the return clause and the component extracted by the process of the where clause is checked.
[0262]
The component extracted by the process of the where clause is a component downstream of the document structure, and the component extracted by the process of the return clause is the first component of the document structure.
[0263]
For example, if the component extracted by the process of the where clause exists below the node corresponding to the component extracted by the process of the return clause, the component extracted by the process of the return clause is extracted. The candidate is registered in the output candidate list as a candidate for the first component of the document structure to be processed.
[0264]
Hereinafter, a case where the XML database is in a state as shown in FIG. 46 will be described as an example. In this case, the structure of the template tree is the same as that in FIG. However, as shown in FIG. 46, vocabularies such as “XML” and “SGML” are not included in the template tree. Also, in FIG. 46, the attributes of the constituent elements are shown surrounded by double rectangles.
[0265]
First, the template tree is searched upstream from the components on the downstream side to check the parent-child relationship.
[0266]
For example, the parent element of the constituent element of the template ID “TID6” (the constituent element in the hierarchy one level above and including the constituent element of the template ID “TID6”) is the configuration of the template ID “TID4”. Is an element. Since the constituent element of the template ID “TID4” is the constituent element extracted by the process of the return clause, the template ID “TID4” is added to the output candidate list at this time.
[0267]
Subsequently, with respect to the constituent element of the template ID “TID7”, the template ID “TID4” is also the parent element as described above. Since the template ID “TID4” has already been registered in the output candidate list, the process proceeds to the next.
[0268]
The parent element of the template ID “TID15” is a constituent element of the template ID “TID14”, but this is not a constituent element extracted by the process of the return clause, and therefore goes back further upstream. There is “TID13” in the hierarchy one level above “TID14”, “TID12” in the hierarchy one level above, and “TID1” in the hierarchy one level above. Since none of these is a component extracted by the processing of the return clause, the template ID “TID15” is excluded from the schema search.
[0269]
Similarly, the template ID “TID17” is also excluded.
[0270]
Each of the template ID “TID57” and the template ID “TID58” has a template ID “TID55” on the upstream side, and this is a component extracted by the process of the return clause, so that the template ID “TID55” is output to the output candidate list. Register with.
[0271]
At this time, “TID4” and “TID55” are registered in the output candidate list.
[0272]
Next, the template tree is searched from upstream to downstream to check the parent-child relationship.
[0273]
The template IDs extracted for the processing of the return clause were “TID4”, “TID55”, and “TID72”. Of these, “TID4” and “TID55” were already registered in the output candidate list. . Therefore, for example, the remaining “TID 72” is additionally registered in the output candidate list. Of course, in this case, “TID 72” may not be registered in the output candidate list.
[0274]
At this time, “TID4”, “TID55”, and “TID72” are registered in the output candidate list.
[0275]
Next, a component group to be extracted is extracted from the document structure starting with the previously extracted component by the schema estimation process in step S304 of FIG.
[0276]
As shown in FIG. 46, the constituent elements constituting the document structure starting with “TID4” are {TID4 (S1.0), TID5, TID6 (D1.72), TID7 (S1.0), TID8, TID9, TID10, TID11} (see FIG. 47).
[0277]
First, based on the information described in the template tree as shown in FIG. 37 (for example, the value of the attribute “required”), the document structure starting with “TID4” is constructed from these components. Select the components that are absolutely necessary to do this.
[0278]
Among the above components, those designated as essential for constructing the document structure starting with “TID4” on the template tree (that is, a configuration in which the attribute value “required” is “yes”) Elements) are “TID5”, “TID6”, “TID7”, and “TID8”, so that a document structure (schema) including these components can be obtained as one of the search results.
[0279]
FIG. 48 shows a schema document (XML document) when the document structure is described in XMLSCHEMA. <Xsd: sequence> is added to this document.
[0280]
Similarly, the constituent elements constituting the document structure starting with “TID55” are {TID55, TID56, TID57, TID58} (see FIG. 49). All of these components are designated as essential for constructing the document structure starting with “TID55” on the template tree, so that the document structure (schema) consisting of the above component group is the other search result. One can be obtained.
[0281]
FIG. 50 shows a schema document (XML document) when the document structure is described in XMLSCHEMA.
[0282]
Similarly, a group of components constituting the document structure starting with “TID72” is {TID72, TID73, TID74, TID75} (see FIG. 51). Since all these components are designated as essential for constructing the document structure starting with “TID72” on the template tree, the document structure (schema) consisting of the above-described component group is the other search result. It can be obtained as one of the following. In the template tree as shown in FIG. 37, for the constituent element of “TID75”, information defining the numerical type, that is, “dataType” is described. Numeric type schema is set.
[0283]
In this manner, when the top component of the document structure is obtained in step S303 of FIG. 43, in step S304, a group of components constituting the document structure starting with the component is extracted from the template tree. At this time, as described above, the component to be extracted may be selected based on the description of each component described in the template tree, or the document structure may be configured without selection. You may extract all the components to do.
[0284]
Further, the extracted document structure is converted into an XML document, that is, a schema document as shown in FIGS. When converting to this schema document, the attribute (for example, numeric type) of each component is described based on the information described for the component on the template tree.
[0285]
As described above, the template tree describes all the information used when searching the schema and creating the schema document as an XML document as shown in FIG.
[0286]
Next, the process proceeds to step S305 in FIG. 43, and an evaluation score (score) is calculated for each schema obtained as a search result.
[0287]
For example, when the constituent element group constituting the schema starting from the constituent element of the template ID “TID4” obtained as a search result is shown together with the first evaluation value (S) and the second evaluation value (D). , {TID4 (S1.0), TID5, TID6 (D1.72), TID7 (S1.0), TID8, TID9, TID10, TID11}. The evaluation point of this schema is the sum of the first evaluation value and the second evaluation value given to each component, and is S1.0 + S1.0 + D1.72 = 3.72.
[0288]
Here, when calculating the sum, weights may be set for the first evaluation value and the second evaluation value, respectively. In the case of the above example, the weight is calculated as “1” for each of the first evaluation value and the second evaluation value.
[0289]
Moreover, when the component group which comprises the schema which begins with the component of template ID "TID55" obtained as a search result is shown with the 1st evaluation value (S) and the 2nd evaluation value (D), For example, it is assumed that {TID55 (S0.8), TID56, TID57 (S1.0), TID58}. The evaluation score of this schema is S0.8 + S1.0 = 1.8 when calculated in the same manner as described above.
[0290]
Furthermore, when the constituent element group constituting the schema starting from the constituent element of the template ID “TID72” obtained as a search result is shown together with the first evaluation value (S) and the second evaluation value (D). , {TID72 (S1.0), TID73, TID74, TID75}. The evaluation score of this schema is “1.0” when calculated in the same manner as described above.
[0291]
Next, the process proceeds to step S306 in FIG. 43 to generate output data.
[0292]
Here, for example, in step S304, the evaluation score obtained in step S305 is written in the schema document of each schema extracted as a search result, and the schema document is collected as one XML document, for example, in FIG. Create the output data as shown. The XML document shown in FIG. 52 is for arranging and displaying schema documents in descending order of evaluation score.
[0293]
The output data (search result document) as shown in FIG. 52 created by the schema search processing unit 303 (along with a style sheet for displaying the search result document as necessary) is output from the result processing unit 12. Are transmitted to the client terminal 102.
[0294]
When the client terminal 102 receives the search result document shown in FIG. 52, the client terminal 102 displays the search result document using the style sheet sent (or predetermined) together with the search result document. The schema selected by the user is displayed using a style sheet sent together with the search result document (or predetermined). FIG. 53 shows a display example of the schema obtained as a search result. The first schema in the search result document shown in FIG. 52 (document structure starting with the component of template ID “TID4”). A display example is shown.
[0295]
In the display example shown in FIG. 53, an input area is provided so as to be an input form as it is.
[0296]
Note that the storage position (structured document path) of the XML database having the schema obtained as a search result on the hierarchical structure of the XML database is the same as the position of the schema on the template tree. When a new XML document created by inputting data from the input form shown is stored in the XML database, the structured document path may be used to specify a path for an insert command or an additional command. Therefore, the user can store the new XML document in an appropriate storage position on the XML database without wondering where to store the newly created XML document.
[0297]
In this way, when a new XML document is created using the search result of the document structure, XML documents that are similar in content (belonging to the same category) have almost the same document structure (to prevent schema overflow). In addition, since the user does not bother to specify the storage location on the XML database, XML documents that are similar in content (belonging to the same category) can be easily managed on the XML database. (For example, in FIG. 46, if a document starts with a component having the element name “patent” in FIG. 46, all of them are stored under the “patent DB” node).
[0298]
Next, on the search condition input screen for schema search, the user selects the above-mentioned (condition 1), for example, “XML-related schema”, that is, the element name, element value, or attribute value as the schema to be searched. Suppose that you enter a search condition that specifies only the character string included in. In this case, there is only a vague request at the keyword level.
[0299]
When such a search condition is input, the client terminal 102 or the request reception unit 11 of the structured document management system searches, for example, a search request statement for searching a document structure including the character string “XML” in the element value. Assume that (schema query) is created.
[0300]
This schema query is processed in the same manner as described above by the schema search processing unit 303 of the structured document management system having the XML database as shown in FIG. In the processing of the where clause, “TID6”, “TID17”, “TID6” are used as template IDs of components having element values of “XML” and a character string (for example, “SGML”) representing a vocabulary similar thereto. TID58 "is extracted by the vocabulary estimation process. Since the similarity of “XML” is “1.0” and the similarity of “SGML” is “0.64”, when these extracted components are shown together with their evaluation values, {TID6 (D1 .67), TID17 (D1.0), TID58 (D1.0)}.
[0301]
Next, the process of the return clause is performed. In this case, there is no condition specified by the user as described in the return clause. In this case, in the return clause of the schema query, for example, “<xsp: return type =“ all ”/>” is described, so the schema search processing unit 303 determines the root node of the document on the template tree. (In other words, as described above, since the attribute value “document” is described in the top component of one XML document on the template tree, the root node can be determined by checking this attribute value. The document structure starting from the root node from the template tree and including any of the components (“TID6”, “TID17”, “TID58”) extracted by processing the where clause The document structure will be extracted.
[0302]
That is, in this case, the process proceeds to step S303 in FIG. First, the component of the template ID “TID6” is traced back to the upstream to reach “TID4” (see FIG. 46). For this component, the description of the template tree as shown in FIG. 37 is checked. Then, since “<xsp: Node type =“ document ”>” is described in this component, the document structure starting with “TID4” is set as an extraction candidate. That is, “TID4” is registered in the output candidate list.
[0303]
Similarly, “TID14” and “TID5” 8 are also added to the output candidate list.
[0304]
In other words, at this point, the document structure to be extracted is
(1) {TID4, TID5, TID6, TID7, TID8, TID9, TID10, TID11}
(2) {TID14, TID15, TID16, TID17}
(3) {TID55, TID56, TID57, TID58}
There are three document structures.
[0305]
Of the components constituting the above three document structures, only the essential components are selected based on the description of the template tree, and the component group (schema) to be extracted is extracted (FIG. 43). Step S304).
[0306]
Further, an evaluation score is obtained for each schema obtained as a search result (step S305 in FIG. 43), and output data is created.
[0307]
As described above, when a schema is set in advance for the retrieved schema (when it is a schema of a document described according to a predetermined schema), a schema document describing the schema is changed. You may search from an XML database and use it as a schema for search results.
[0308]
Further, schema information may be additionally notified to the client together with the data for the search result by XQuery or the like. The result of searching with XQuery is only data, and there is a part where the structure is difficult to see, but it is possible to complement it. In addition, by describing the cumulative counts for the template elements as the result information, browsing them makes it easy to grasp the paths in the database and their frequencies at a glance.
[0309]
Further, the access authority may be set for each component on the template tree. For example, it is assumed that there are a department A and a department B, and XML documents in the respective jurisdiction ranges are stored in a predetermined area on the XML database. In this case, the department B cannot access the jurisdiction document of the department A, but the document B may be set with an access authority such as publishing if only the schema is available. By publishing the schema, for example, when application linkage in department A and department B is performed later, cost reduction can be achieved without absorbing the difference between the schemas.
[0310]
Also, when trying to store various documents in the XML database, it is often difficult to know where this data is suitable for storage, and it is often only temporarily stored in an appropriate location. In that case, the information is often buried. Under the management of the XML database, they are managed on the server side and are represented as one XML tree. Therefore, a guideline indicating where to store the document to be stored is also provided. Furthermore, the storage location does not have to be a folder and may be a partial document. Therefore, it is possible to reduce schema flooding as much as possible.
[0311]
As described above, according to the above-described embodiment, a desired schema and a position in the XML database from which the schema is extracted from an ambiguous condition whose structure and vocabulary are not determined are determined from the XML database having a hierarchical structure. It is possible to easily search with or without an explicit definition of.
[0312]
【The invention's effect】
As described above, according to the present invention, a document structure can be efficiently searched based on a vague search condition from an XML database having a hierarchical structure.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of a structured document management system according to an embodiment of the present invention.
FIG. 2 is a diagram showing a usage form of the structured document management system shown in FIG. 1, and shows a case where the structured document management system is operating on the back end of the WWW.
FIG. 3 is a diagram showing an example of a structured document described in XML.
4 is a diagram schematically showing the document structure of the structured document in FIG. 3;
FIG. 5 is a diagram for explaining the function of an additional command, and shows a case where the additional command is executed in the initial state of the structured document database.
6 is a view showing a processing result when an acquisition command is executed on the structured document database in the state shown in FIG.
FIG. 7 shows a case where a document object tree of one “patent” information is added to the structured document database in the state shown in FIG. 5B by executing an add command.
FIG. 8 shows a case where a document object tree of three “patent” information is added to the structured document database in the state shown in FIG. 5B by executing an add command.
FIG. 9 is a diagram showing a storage example of an element name occurrence index.
FIG. 10 is a diagram showing a storage example of a data occurrence index.
11 is a diagram showing an execution result when an acquisition command for extracting three pieces of “patent” information is executed on the structured document database in the state shown in FIG. 8. FIG.
FIG. 12 is a diagram showing an example of a schema that defines the document structure of an XML document.
13 is a diagram showing a case where a schema storage command is executed in the structured document database in the state shown in FIG. 8 and the schema shown in FIG. 12 is additionally stored (set).
FIG. 14 is a diagram showing a document object tree in which a schema is set and an attribute value indicating that the schema exists is set.
FIG. 15 is a diagram conceptually illustrating a state in which attribute values indicating that a schema exists are stored in each object file.
FIG. 16 is a diagram showing an example in which a conceptual hierarchy used in a search is expressed as a structured document as necessary.
FIG. 17 is a diagram showing an example in which a conceptual hierarchy used in a search is expressed as a structured document as necessary.
18 is a diagram showing a case where a document object tree of “concept” information shown in FIGS. 16 and 17 is added to the structured document database in the state shown in FIG. 8 by executing an add command.
19 is a diagram showing a case where a document object tree of “concept” information shown in FIGS. 16 and 17 is added to the structured document database in the state shown in FIG. 8 by executing an add command.
FIG. 20 is a diagram showing an example of a query (XML document).
FIG. 21 is a diagram showing an example of a simple search query (XML document).
FIG. 22 is a diagram showing search results (XML document) using the simple search query of FIG. 21;
FIG. 23 is a diagram showing an example of a concept search query (XML document).
24 is a flowchart for explaining the document search processing operation of the structured document management system in FIG. 1. FIG.
FIG. 25 is a diagram showing a display example of a screen as a user interface.
FIG. 26 is a diagram showing a display example of a screen as a user interface for performing a document search.
FIG. 27 is a diagram showing a query created based on information input from the screen shown in FIG.
FIG. 28 is a flowchart for explaining the document acquisition processing operation of the structured document management system of FIG. 1;
FIG. 29 is a diagram showing a display example of a structured document obtained as a result of executing a document acquisition command.
FIG. 30 is a display example of a screen as a user interface for performing a document search, for explaining a schema search processing operation;
FIG. 31 is a diagram showing an example of a schema search query.
FIG. 32 shows a display example of a screen as a user interface for acquiring a schema, and shows a display example of the acquired schema.
FIG. 33 is a diagram illustrating a configuration example of a template processing unit.
FIG. 34 is a diagram showing an example of a template tree.
FIG. 35 is a diagram showing a specific example of an XML document.
36 is a diagram showing the template tree shown in FIG. 34 after being updated.
FIG. 37 is a diagram showing a storage example of a template tree.
FIG. 38 is a diagram showing a configuration example of a schema search processing unit 303.
FIG. 39 is a diagram showing an example of a schema query.
FIG. 40 is a diagram showing another example of a schema query.
FIG. 41 is a diagram showing a specific example of a semantic network.
FIG. 42 is a flowchart for explaining processing for obtaining a similar word of a keyword specified as a search condition using a semantic network.
FIG. 43 is a flowchart for explaining the processing operation of the schema search processing unit.
FIG. 44 is a diagram showing a display example of a search condition input screen for schema search.
45 is a diagram showing an example of a schema query created based on a search condition input from the input screen shown in FIG. 44.
FIG. 46 is a diagram showing a specific example of a template tree.
FIG. 47 is a diagram showing a document structure obtained as a result of search.
48 shows an example of an XML document describing the document structure shown in FIG. 47. FIG.
FIG. 49 shows another document structure obtained as a result of the search.
50 is a diagram showing an example of an XML document describing the document structure shown in FIG. 49. FIG.
FIG. 51 is a diagram showing still another document structure obtained as a result of the search.
FIG. 52 is a diagram showing an example of output data.
FIG. 53 is a diagram showing a display example of a schema obtained as a search result.
FIG. 54 is a flowchart for explaining a processing procedure of a where clause in a schema query.
FIG. 55 is a flowchart for explaining structure estimation processing;
FIG. 56 is a flowchart for explaining vocabulary estimation processing;
FIG. 57 is a flowchart for explaining the processing procedure of a return clause in a schema query.
[Explanation of symbols]
100 ... structured document management system
303 ... Schema search processing unit
306 ... Template processing unit
306a ... Structure extraction unit
306b ... Template tree processing unit
321 ... Schema query analysis unit
322 ... Schema query condition processing unit
323 ... Schema query output processing unit
404 ... Template storage unit
405 ... Meaning network storage unit

Claims

Documents stored in the same storage location on the hierarchical structure on a logical structure including each document structure of a plurality of structured documents stored in a database having a hierarchical structure and a storage location on the hierarchical structure Storage means for storing a template tree in which constituent elements common to a plurality of structured documents having the same structure or a part of the same are aggregated into one and expressed as one constituent element;
Processing means for performing processing for retrieving a desired document structure from different document structures of a plurality of structured documents stored in the database;
A document structure search method in a document structure search apparatus comprising:
A step of obtaining a similar word of a keyword input as a search condition for the processing means to search the document structure;
The processing means for searching for a constituent element including any one of the keyword and the similar word in an element name or an element value from the hierarchical structure composed of the constituent elements constituting the different document structures; When,
The processing means detects the first component of the document structure including the searched component by tracing the template tree upstream from the searched component, and from the template tree, An extraction step of extracting the document structure starting from the detected component ;
An output step in which the processing means outputs the extracted document structure as a search result;
Document structure search method including

In the template tree , first attribute information indicating that the constituent element is the first constituent element is added to the first constituent element of the document structure corresponding to each of the plurality of structured documents. ,
The extraction step is based on it said first attribute information, document structure search method according to claim 1, wherein the extracting the document structure as the search result from the template tree.

The template tree indicates that the constituent element is an essential constituent element with respect to the constituent elements essential to the document structure among the constituent elements constituting the document structure corresponding to each of the plurality of structured documents. Second attribute information is added,
The extraction step is based on said second attribute information, document structure search method according to claim 1, wherein the extracting the document structure comprising the essential components from the template tree.

The processing means further includes a step of obtaining an evaluation value of the document structure extracted in the extraction step based on the similarity between the keyword and the element name or element value of the searched component ,
2. The document structure search method according to claim 1 , wherein the output step outputs the extracted document structure as the search result together with the evaluation value.

Documents stored in the same storage location on the hierarchical structure on a logical structure including each document structure of a plurality of structured documents stored in a database having a hierarchical structure and a storage location on the hierarchical structure Storage means for storing a template tree in which constituent elements common to a plurality of structured documents having the same structure or a part of the same are aggregated into one and expressed as one constituent element;
Processing means for performing processing for retrieving a desired document structure from different document structures of a plurality of structured documents stored in the database;
A document structure search method in a document structure search apparatus comprising:
The processing means obtaining a similar word of the first keyword and the second keyword input as a search condition for searching the document structure;
The processing means includes a first name including one of the first keyword and its similar word in an element name or an element value from the hierarchical structure configured by each component configuring each of the different document structures . A first search step for searching for a component ;
A second search step in which the processing means searches for a second component that includes any one of the second keyword and its similar words in the element name from the hierarchical structure ;
When the processing means reaches the second component element by tracing the template tree upstream from the first component element, the second component element starts from the template tree. An extraction step for extracting the document structure ;
An output step in which the processing means outputs the extracted document structure as a search result;
Document structure search method including

In the template tree , first attribute information indicating that the constituent element is the first constituent element is added to the first constituent element of the document structure corresponding to each of the plurality of structured documents. ,
6. The document structure search method according to claim 5 , wherein the extraction step extracts a document structure as the search result from the template tree based on the first attribute information.

The template tree indicates that the constituent element is an essential constituent element with respect to the constituent elements essential to the document structure among the constituent elements constituting the document structure corresponding to each of the plurality of structured documents. Second attribute information is added,
6. The document structure search method according to claim 5 , wherein the extracting step extracts a document structure including the essential components from the template tree based on the second attribute information.

The processing means includes a first similarity between the element name or element value of the first component and the first keyword, and a first similarity between the element name of the second component and the second keyword. Further comprising the step of obtaining an evaluation value of the document structure extracted in the extraction step based on the similarity of 2 ;
6. The document structure search method according to claim 5 , wherein the output step outputs the extracted document structure as the search result together with the evaluation value.

A document structure search device for searching a desired document structure from different document structures of a plurality of structured documents stored in a database having a hierarchical structure,
A document structure stored in the same storage position on the hierarchical structure on a logical structure including the document structure of each of the plurality of structured documents stored in the database and the storage position on the hierarchical structure. Storage means for storing a template tree in which constituent elements common to a plurality of structured documents that are the same or partially the same are aggregated into one constituent element;
Input means for inputting at least one keyword as a search condition for searching the document structure;
Means for obtaining a similar word of the keyword input by the input means;
Search means for searching for a constituent element including any one of the keyword and the similar word in an element name or an element value from the hierarchical structure constituted by each constituent element constituting each of the different document structures;
By tracing the template tree upstream from the component searched by the search means, the first component of the document structure including the searched component is detected, and the template tree Extraction means for extracting the document structure starting from the detected component ;
Means for outputting the document structure extracted by the extraction means as a search result;
A document structure retrieval apparatus comprising:

In the template tree , first attribute information indicating that the constituent element is the first constituent element is added to the first constituent element of the document structure corresponding to each of the plurality of structured documents. ,
10. The document structure search apparatus according to claim 9 , wherein the extraction unit extracts a document structure as the search result from the template tree based on the first attribute information.

The template tree indicates that the constituent element is an essential constituent element with respect to the constituent elements essential to the document structure among the constituent elements constituting the document structure corresponding to each of the plurality of structured documents. Second attribute information is added,
10. The document structure search apparatus according to claim 9, wherein a document structure composed of the essential components is extracted from the template tree based on the second attribute information.

A document structure search device for searching a desired document structure from different document structures of a plurality of structured documents stored in a database having a hierarchical structure,
A document structure stored in the same storage position on the hierarchical structure on a logical structure including the document structure of each of the plurality of structured documents stored in the database and the storage position on the hierarchical structure. Storage means for storing a template tree in which constituent elements common to a plurality of structured documents that are the same or partially the same are aggregated into one constituent element;
An input means for inputting at least one first keyword and at least one second keyword as a search condition for searching the document structure;
Means for obtaining a similar word of the first keyword input by the input means and a similar word of the second keyword;
A first component that includes either the first keyword or its similar word in an element name or an element value is searched from the hierarchical structure configured by the components that constitute each of the different document structures. A first search means for
Second search means for searching for a second component that includes any one of the second keyword and its similar word in an element name from the hierarchical structure;
When the second component is reached by tracing the template tree upstream from the first component, a document structure starting from the second component is extracted from the template tree. Extraction means;
Output means for outputting the document structure extracted by the extraction means as a search result;
A document structure retrieval apparatus comprising:

In the template tree , first attribute information indicating that the constituent element is the first constituent element is added to the first constituent element of the document structure corresponding to each of the plurality of structured documents. ,
13. The document structure search apparatus according to claim 12 , wherein the extraction unit extracts a document structure as the search result from the template tree based on the first attribute information.

A document structure search program for searching a desired document structure from different document structures of a plurality of structured documents stored in a database having a hierarchical structure,
The computer,
A document structure stored in the same storage position on the hierarchical structure on a logical structure including the document structure of each of the plurality of structured documents stored in the database and the storage position on the hierarchical structure. Storage means for storing a template tree in which constituent elements that are common to a plurality of structured documents that are the same or partially the same are aggregated into a single constituent element;
As a search condition for searching the document structure, the input means to enter at least one keyword,
Means for obtaining a similar word of the keyword input by the input means;
Search means for searching for a constituent element including any one of the keyword and the similar word in an element name or an element value from the hierarchical structure composed of the constituent elements constituting each of the different document structures;
By tracing the template tree upstream from the component searched by the search means, the first component of the document structure including the searched component is detected, and the template tree Extraction means for extracting the document structure starting from the detected component;
Means for outputting the document structure extracted by the extraction means as a search result;
Program to function as.

A document structure search program for searching a desired document structure from different document structures of a plurality of structured documents stored in a database having a hierarchical structure,
The computer,
A document structure stored in the same storage position on the hierarchical structure on a logical structure including the document structure of each of the plurality of structured documents stored in the database and the storage position on the hierarchical structure. Storage means for storing a template tree in which constituent elements that are common to a plurality of structured documents that are the same or partially the same are aggregated into a single constituent element;
Input means for inputting at least one first keyword and at least one second keyword as a search condition for searching the document structure;
Means for obtaining a similar word of the first keyword input by the input means and a similar word of the second keyword;
A first component that includes either the first keyword or its similar word in an element name or an element value is searched from the hierarchical structure configured by the components that constitute each of the different document structures. A first search means for
A second search means for searching for a second component including, in the element name, any one of the second keyword and similar words from the hierarchical structure;
When the second component is reached by tracing the template tree upstream from the first component, a document structure starting from the second component is extracted from the template tree. Extraction means,
Output means for outputting the document structure extracted by the extraction means as a search result;
Program to function as .