JP2001101184A

JP2001101184A - Structured document generation method and apparatus, and storage medium storing structured document generation program

Info

Publication number: JP2001101184A
Application number: JP28193799A
Authority: JP
Inventors: Kaori Inoue; 香織井上; Seiji Yokomichi; 誠司横路; Katsumi Takahashi; 克巳高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1999-10-01
Filing date: 1999-10-01
Publication date: 2001-04-13

Abstract

(57)【要約】（修正有）【課題】非構造化文書を構造化する際の属性判定基準
をテーマによって可変とし、検索時に検索者が選択した
テーマ別の検索を可能とする。【解決手段】各テーマ毎に基本的な属性セットを設定
しておき、半構造化文書が入力されると、該文書の文字
列に対して予め登録されているパターンとのパターンマ
ッチング、及び単語と複数の属性名が対応して記述さて
いる属性辞書との辞書マッチングを行って、該半構造化
文書の文字列に対する属性候補を抽出し、テーマ毎の属
性を参照して、半構造化文書中に出現する可能性のある
属性を取得すると共に、属性同士が共起関係にあるか、
排他関係にあるかを示す属性関係ルールを参照して優先
度を付与し、属性候補のうち、優先度が大きいものを属
性として採用し、構造化文書として出力する。 (57) [Summary] (with correction) [PROBLEMS] To make an attribute determination criterion for structuring an unstructured document variable according to a theme, and enable a search for each theme selected by a searcher at the time of search. SOLUTION: A basic attribute set is set for each theme, and when a semi-structured document is inputted, pattern matching with a pattern registered in advance for a character string of the document and word And the attribute dictionary in which a plurality of attribute names are described in correspondence with each other, to extract attribute candidates for the character string of the semi-structured document, Get the attributes that may appear inside and check if the attributes are co-occurring,
A priority is given with reference to an attribute relationship rule indicating whether or not the document has an exclusive relationship. Among the attribute candidates, a candidate having a higher priority is adopted as an attribute and output as a structured document.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、構造化文書生成方
法及び装置及び構造化文書生成プログラムを格納した記
憶媒体に係り、特に、テーマ別文書検索を目的として、
テーマに基づいた非構造化文書の構造化を行うための構
造化文書生成方法及び装置及び構造化文書生成プログラ
ムを格納した記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for generating a structured document and a storage medium storing a structured document generating program.
The present invention relates to a structured document generation method and apparatus for structuring an unstructured document based on a theme, and a storage medium storing a structured document generation program.

【０００２】[0002]

【従来の技術】プレーンな文書中の情報（特性の文字
列）に対して、属性を与え、その属性間の関係を明らか
にすることを構造化という。属性には、意味属性と論理
属性があるが、ここで扱うのは意味属性である。意味属
性の場合、そのテーマ（視点）により、ある文字列に与
えられる属性は異なる。2. Description of the Related Art Structuring is to assign attributes to information (character strings of characteristics) in a plain document and clarify the relationship between the attributes. There are semantic attributes and logical attributes in the attributes, but the semantic attributes are handled here. In the case of a semantic attribute, the attribute given to a certain character string differs depending on the theme (viewpoint).

【０００３】例えば、次の２つの文はいずれも『カメ
ラ』に関する文である。 “カメラ”で写真をとる方法 “カメラ”大幅値下げこの２つの文で単語『カメラ』はそれぞれ異なる使われ
方をしている。の文では、写真をとる「手段としての
カメラ」であり、の文では値下げの対象「商品として
のカメラ」である。このような、文脈毎の単語の使われ
方を「属性」と呼ぶ。ちなみに「属性」を決めるのは、
作者とは限らない。読者でもよいし、なんらかのシステ
ムでもよい。従来の全文検索（例：“goo"http://www.g
oo.ne.jp）では、上記のような「属性」は考慮される。
入力されたキーワードにマッチする全ての結果を返すた
め、無駄な検索結果が多く含まれてしまう。しかし、情
報を属性によって分類しておけば、ユーザは本当に欲し
い情報だけを得ることができる。ここで、テキスト中の
情報（単語等）に属性を与えることを「構造化」と呼
ぶ。For example, the following two sentences are all sentences relating to "camera". How to take a picture with “camera” Significantly lower price for “camera” In these two sentences, the word “camera” is used differently. Is a "camera as a means" for taking a picture, and a statement is a "camera as a product" to be reduced. Such usage of words for each context is called an "attribute". By the way, the attribute is decided
Not necessarily the author. It can be a reader or some kind of system. Conventional full-text search (eg "goo" http: //www.g
oo.ne.jp), the above “attributes” are taken into account.
Since all results that match the input keyword are returned, many useless search results are included. However, if the information is classified by attributes, the user can obtain only the information that he really wants. Here, giving an attribute to information (a word or the like) in the text is called “structuring”.

【０００４】従来の文書構造化は、テーマ（視点）を固
定することで、ある文字列に与える属性を１つに特定し
ている。例えば、文書中に「りんご」という文字列が現
れた場合には、文書のテーマ（視点）によって、「果
物」や「農産物」「おやつ」など様々な属性が付与され
る可能性がある。しかし、属性辞書に、「果物」とだけ
記述することで「りんご」という文字列には常に「果
物」という属性が与えられる。The conventional document structuring specifies a single attribute to be given to a character string by fixing a theme (viewpoint). For example, when a character string “apple” appears in a document, various attributes such as “fruit”, “produce”, and “snack” may be given depending on the theme (viewpoint) of the document. However, by describing only "fruit" in the attribute dictionary, the character string "apple" always has the attribute "fruit".

【０００５】また、論理属性も、属性は特定されるの
で、論理関係ルールや、辞書などの属性値抽出ルールを
用いて、特定の属性付与を行う。従来の文書構造化装置
の例を図９に示す。同図に示す文書構造化装置は、半構
造化文書入力部１１、属性値抽出部１２、抽出ルールデ
ータベース１３、及び構造化文書出力部１４から構成さ
れる。[0005] In addition, since the attribute of the logical attribute is specified, a specific attribute is assigned using a logical relation rule or an attribute value extraction rule such as a dictionary. FIG. 9 shows an example of a conventional document structuring apparatus. The document structuring apparatus shown in FIG. 1 includes a semi-structured document input unit 11, an attribute value extraction unit 12, an extraction rule database 13, and a structured document output unit 14.

【０００６】当該文書構造化装置において、半構造化文
書入力部１１において文書を入力すると、属性値抽出部
１２が抽出ルールデータベース１３を参照して、ある文
字列に対し、特定の属性を付与する。構造化文書出力部
１４は、属性値が付与された文字列を統合した構造化文
書を出力する。詳細は、特開平９−６９１０１に開示さ
れている。In the document structuring device, when a document is input in the semi-structured document input unit 11, the attribute value extracting unit 12 refers to the extraction rule database 13 and assigns a specific attribute to a certain character string. . The structured document output unit 14 outputs a structured document in which a character string to which an attribute value has been added is integrated. Details are disclosed in JP-A-9-69101.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、上記従
来の文書構造化装置では、属性判定基準をテーマによっ
て可変とすることはできず、検索時において検索者が選
択したテーマ別の検索を柔軟に行うことができないとい
う問題がある。本発明は、上記の点に鑑みなされたもの
で、非構造化文書を構造化する際の属性判定基準をテー
マによって可変とし、検索時に検索者が選択したテーマ
別の検索を可能とする構造化文書生成方法及び装置及び
構造化文書生成プログラムを格納した記憶媒体を提供す
ることを目的とする。However, in the above-described conventional document structuring apparatus, the attribute determination criterion cannot be changed according to the theme, and the search by the theme selected by the searcher at the time of the search is flexibly performed. There is a problem that you can not. SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and has a structure in which an attribute determination criterion when structuring an unstructured document is variable according to a theme, and a search can be performed for each theme selected by a searcher during a search. An object of the present invention is to provide a document generation method and apparatus, and a storage medium storing a structured document generation program.

【０００８】[0008]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明（請求項１）は、テー
マ別文書検索を目的として、テーマに基づいた非構造化
文書の構造化を行うための構造化文書生成方法におい
て、予め検索の視点であるテーマを設定すると共に、各
テーマ毎に基本的な属性セットを設定しておき（ステッ
プ１）、半構造化文書が入力されると（ステップ２）、
該文書の文字列に対して予め登録されているパターンと
のパターンマッチング、及び単語と複数の属性名が対応
して記述さている属性辞書との辞書マッチングを行っ
て、半構造化文書の文字列に対する属性候補を抽出し
（ステップ３）、抽出された属性候補について、テーマ
毎の属性を参照して、半構造化文書中に出現する可能性
のある属性を取得すると共に、該半構造化文書中に出現
する可能性のある属性同士が共起関係にあるか、排他関
係にあるかを示す属性関係ルールを参照して、共起関係
または、排他関係に応じて優先度を付与し（ステップ
４）、属性候補のうち、優先度が大きいものを属性とし
て採用し（ステップ５）、採用された属性に基づいて入
力された半構造化文書の文字列に対してタグ付けを行い
（ステップ６）、構造化文書として出力する（ステップ
７）。FIG. 1 is a diagram for explaining the principle of the present invention. The present invention (Claim 1) provides a structured document generation method for structuring an unstructured document based on a theme for the purpose of searching documents by theme. A basic attribute set is set for each theme (step 1), and when a semi-structured document is input (step 2),
The character string of the semi-structured document is subjected to pattern matching with a pattern registered in advance for the character string of the document, and dictionary matching with a word and an attribute dictionary in which a plurality of attribute names are correspondingly described. Are extracted (step 3), and with respect to the extracted attribute candidates, attributes which may appear in the semi-structured document are obtained by referring to the attributes for each theme, and the semi-structured document is obtained. A priority is given according to the co-occurrence relationship or the exclusive relationship with reference to an attribute relationship rule indicating whether the attributes that may appear in the co-occurrence relationship or the exclusive relationship are present (step 4) Among the attribute candidates, the one with the highest priority is adopted as the attribute (step 5), and the character string of the semi-structured document input based on the adopted attribute is tagged (step 6). ), Structured And outputs as written (Step 7).

【０００９】本発明（請求項２）は、属性候補の優先度
と、所定の閾値を比較して、該優先度が該閾値を下回る
場合には、該属性候補を削除する。本発明（請求項３）
は、テーマ別文書検索を目的として、テーマに基づいた
非構造化文書の構造化を行うための構造化文書生成装置
であって、予め検索の視点であるテーマを指定するテー
マ指定手段２９と、各テーマ毎に基本的な属性セットが
格納されている属性セット記憶手段２６と、記号、文字
列、品詞を含むパターンが格納されているパターン記憶
手段２３と、単語と複数の属性名が対応付けられて格納
されている辞書記憶手段２４と、ある属性と他の属性が
共起関係にあるか、排他関係にあるかを示す属性関係ル
ールが格納されている属性関係ルール記憶手段２７と、
半構造化文書を入力する半構造化文書入力手段２１と、
半構造化文書入力手段２１から半構造化文書が入力され
ると、該文書の文字列に対してパターン記憶手段２３を
参照してパターンマッチングを行い、さらに、辞書記憶
手段２３を参照して辞書マッチングを行い、該半構造化
文の文字列に対する属性候補を抽出する属性候補抽出手
段２２と、属性候補抽出手段２２において抽出された属
性候補について、属性セット記憶手段２６を参照して、
該半構造化文書中に出現する可能性がある属性を取得す
ると共に、属性関係ルール記憶手段２７を参照して、該
半構造化文書中に出現する可能性のある属性同士が共起
関係にあるか、または、排他関係にあるかに応じて優先
度を付与し、該優先度が大きい属性候補を属性として採
用する属性コスト計算手段２５と、採用された属性に基
づいて入力された半構造化文書の文字列に対してタグ付
けを行い、構造化文書として出力する構造化文書出力手
段２８とを有する。According to the present invention (claim 2), the priority of the attribute candidate is compared with a predetermined threshold, and if the priority is lower than the threshold, the attribute candidate is deleted. The present invention (claim 3)
Is a structured document generating apparatus for structuring a non-structured document based on a theme for the purpose of searching documents by theme, and a theme specifying means 29 for specifying in advance a theme which is a viewpoint of search; An attribute set storage unit 26 storing a basic attribute set for each theme, a pattern storage unit 23 storing a pattern including a symbol, a character string, and a part of speech, and a word and a plurality of attribute names are associated with each other. A dictionary storage means 24 stored and stored; an attribute relation rule storage means 27 storing attribute relation rules indicating whether a certain attribute and another attribute are co-occurring or exclusive;
A semi-structured document input unit 21 for inputting a semi-structured document;
When a semi-structured document is input from the semi-structured document input means 21, pattern matching is performed on the character string of the document by referring to the pattern storage means 23, and further, the dictionary is referred to by referring to the dictionary storage means 23. Attribute candidate extracting means 22 that performs matching and extracts attribute candidates for the character string of the semi-structured sentence, and the attribute candidates extracted by the attribute candidate extracting means 22,
The attribute that may appear in the semi-structured document is acquired, and the attributes that may appear in the semi-structured document have a co-occurrence relationship with reference to the attribute relation rule storage unit 27. Attribute cost calculating means 25 for assigning a priority according to whether the attribute is present or in an exclusive relationship, and adopting an attribute candidate having the higher priority as an attribute; and a semi-structure input based on the adopted attribute. And structured document output means 28 for tagging a character string of the structured document and outputting the result as a structured document.

【００１０】本発明（請求項４）属性コスト計算手段２
５において、属性候補の優先度と、所定の閾値を比較し
て、該優先度が該閾値を下回る場合には、該属性候補を
削除する手段を含む。本発明（請求項５）は、テーマ別
文書検索を目的として、テーマに基づいた非構造化文書
の構造化を行うための構造化文書生成プログラムを格納
した記憶媒体であって、半構造化文書を入力させる半構
造化文書入力プロセスと、半構造化文書が入力される
と、該文書の文字列に対して予め登録されている記号、
文字列、品詞を含むパターンを参照してパターンマッチ
ングを行い、さらに、予め単語と複数の属性名が対応付
けられて登録されている辞書を参照して辞書マッチング
を行い、該半構造化文書の文字列に対する属性候補を抽
出する属性候補抽出プロセスと、属性候補抽出プロセス
において抽出された属性候補について、各テーマ毎に予
め登録されている基本的な属性セットを参照して、該半
構造化文書中に出現する可能性のある属性を取得すると
共に、該半構造化文書中に出現する可能性のある属性同
士が共起関係にあるか、または、排他関係にあるかを示
す属性関係ルールを参照して、該属性候補の共起関係ま
たは、排他関係に応じて優先度を付与し、該優先度が大
きい属性候補を属性として採用する属性コスト計算プロ
セスと、採用された属性に基づいて入力された半構造化
文書の文字列に対してタグ付けを行い、構造化文書とし
て出力させる構造化文書出力プロセスとを有する。The present invention (Claim 4) Attribute cost calculation means 2
In step 5, means for comparing the priority of the attribute candidate with a predetermined threshold and deleting the attribute candidate when the priority is lower than the threshold is included. The present invention (claim 5) is a storage medium storing a structured document generation program for structuring an unstructured document based on a theme for the purpose of searching documents by theme, comprising: And a semi-structured document input process for inputting, when a semi-structured document is input, a symbol registered in advance for a character string of the document,
Pattern matching is performed with reference to a pattern including a character string and a part of speech, and dictionary matching is performed with reference to a dictionary in which a word and a plurality of attribute names are registered in advance and the semi-structured document is The attribute candidate extraction process for extracting attribute candidates for the character string and the attribute candidates extracted in the attribute candidate extraction process are referred to a basic attribute set registered in advance for each theme, and the semi-structured document is referred to. Attributes that may appear in the document, and an attribute relationship rule that indicates whether the attributes that may appear in the semi-structured document are co-occurring or exclusive. Referring to the attribute cost calculation process of assigning a priority according to the co-occurrence relationship or the exclusive relationship of the attribute candidates and employing the attribute candidate having the higher priority as the attribute, To tag on a string of semi-structured document input based on the attribute, and a structured document outputting process of outputting a structured document.

【００１１】本発明（請求項６）は、属性コスト計算プ
ロセスにおいて、属性候補の優先度と、所定の閾値を比
較して、該優先度が該閾値を下回る場合には、該属性候
補を削除するプロセスを含む。上記のように、本発明で
は、予め設定れたテーマ毎に基本属性セット及び基本属
性関係ルールを設定することにより、属性判定基準を可
変とすることが可能となる。In the present invention (claim 6), in the attribute cost calculation process, the priority of the attribute candidate is compared with a predetermined threshold, and if the priority is lower than the threshold, the attribute candidate is deleted. Process. As described above, according to the present invention, by setting a basic attribute set and a basic attribute relation rule for each preset theme, it is possible to make the attribute determination standard variable.

【００１２】さらに、文書内の情報（文字列）に初めに
複数の属性を付与し属性候補としておき、基本属性の共
起や排他の関係ルールを参照することにより属性を特定
することが可能となり、検索時に検索者が選択したテー
マ別の検索を可能とする。Furthermore, a plurality of attributes are first assigned to information (character strings) in a document and attribute candidates are set, and attributes can be specified by referring to co-occurrence and exclusion relation rules of basic attributes. In addition, a search for each theme selected by the searcher at the time of search is enabled.

【００１３】[0013]

【発明の実施の形態】図３は、本発明の文書構造化装置
の構成を示す。同図に示す文書構造化装置は、半構造化
文書入力部２１、属性候補抽出部２２、パターンデータ
ベース２３、辞書データベース２４、属性コスト計算部
２５、属性セットデータベース２６、属性関係ルールデ
ータベース２７、構造化文書出力部２８及びテーマ指定
部２９から構成される。FIG. 3 shows a configuration of a document structuring apparatus according to the present invention. The document structuring apparatus shown in FIG. 1 includes a semi-structured document input unit 21, an attribute candidate extraction unit 22, a pattern database 23, a dictionary database 24, an attribute cost calculation unit 25, an attribute set database 26, an attribute relation rule database 27, a structure It comprises a structured document output unit 28 and a theme designation unit 29.

【００１４】半構造化文書入力部２１は、半構造化文書
を入力し、属性候補抽出部２２に転送する。属性候補抽
出部２２は、入力された半構造化文書のある文字列につ
いて、パターンデータベース２３や辞書データベース２
４を参照してパターンマッチ、及び辞書マッチを行うも
のであり、パターンマッチ処理部２２１、辞書マッチ処
理部２２２から構成される。The semi-structured document input unit 21 inputs a semi-structured document and transfers it to the attribute candidate extracting unit 22. The attribute candidate extraction unit 22 extracts a character string of the input semi-structured document from the pattern database 23 or the dictionary database 2.
4 for performing pattern matching and dictionary matching, and includes a pattern matching processing unit 221 and a dictionary matching processing unit 222.

【００１５】パターンマッチ処理部２２１は、パターン
データベース２３を参照してパターンマッチを行う。当
該パターンデータベース２３は、記号や文字列、品詞な
どのパターンと、複数属性名が対応して記述されてい
る。辞書マッチ処理部２２２は、辞書データベース２４
を参照して、辞書マッチ処理を行う。当該辞書データベ
ース２４は、単語と複数属性名が対応して記述されてい
る。属性候補抽出部２２において、これらのデータベー
ス２３、２４とマッチングを行うことにより、文書デー
タに対し、全ての属性候補が抽出される。The pattern matching processing section 221 performs pattern matching with reference to the pattern database 23. The pattern database 23 describes patterns such as symbols, character strings, parts of speech, and the like, and a plurality of attribute names. The dictionary match processing unit 222 stores the dictionary database 24
And performs dictionary matching processing. The dictionary database 24 describes words and a plurality of attribute names in association with each other. By performing matching with these databases 23 and 24 in the attribute candidate extraction unit 22, all attribute candidates are extracted from the document data.

【００１６】属性コスト計算部２５は、各属性候補のコ
スト計算を行う。コスト計算は、属性セットデータベー
ス２６を参照して、テーマ毎の属性セットを取得し、次
に、属性関係ルールデータベース２７を参照して、属性
セットデータベース２６から取得した属性セット中の属
性間関係の共起・排他関係を調べ、重み計算を行う。属
性セットデータベース２６には、テーマ名とそれに対応
する属性のセットが記述されている。属性関係ルールデ
ータベース２７には、ある属性と他のある属性が共起関
係にあるか、排他関係にあるかが重みで示してある。The attribute cost calculator 25 calculates the cost of each attribute candidate. In the cost calculation, an attribute set for each theme is acquired by referring to the attribute set database 26, and then, the attribute relations in the attribute set acquired from the attribute set database 26 are acquired by referring to the attribute relation rule database 27. Examine co-occurrence / exclusivity and calculate weight. The attribute set database 26 describes a set of theme names and attributes corresponding to the theme names. The attribute relation rule database 27 indicates by weight whether a certain attribute and another certain attribute have a co-occurrence relationship or an exclusive relationship.

【００１７】属性コスト計算部２５では、ある文字列に
対する複数の属性候補の中から、重みが重いものを優先
して属性に採用する。また、属性候補が一つしかないも
ので、重みが閾値を下回った場合には、その属性を削除
する。構造化文書出力部２８は、属性コスト計算部２５
で特定された属性を文字列に対してタグ付けして出力す
る。The attribute cost calculation unit 25 preferentially employs, from among a plurality of attribute candidates for a certain character string, one having a higher weight as an attribute. If there is only one attribute candidate and the weight is below the threshold, the attribute is deleted. The structured document output unit 28 includes an attribute cost calculation unit 25
Tag the character string with the attribute specified in and output.

【００１８】テーマ指定部２９は、利用者が属性コスト
計算部２５に対して決定されたテーマを入力する。図４
は、本発明の文書構造化装置の動作を示すフローチャー
トである。ステップ１０１）まず、前処理として、検索のテーマ
を複数設定しておき、データベースとルールを予め用意
しておく。データベースとして、各テーマ毎に属性を属
性セットデータベース２６にセットする。記号及び数字
属性抽出のためのパターンを作成し、パターンデータベ
ース２３に設定する。文字列属性を抽出するための情報
を辞書データベース２４に設定する。属性の関係を記述
した属性関係ルールを属性関係ルールデータベース２７
に設定する。The theme designation section 29 allows the user to input the determined theme to the attribute cost calculation section 25. FIG.
5 is a flowchart showing the operation of the document structuring device of the present invention. Step 101) First, as preprocessing, a plurality of search themes are set, and a database and rules are prepared in advance. Attributes are set in the attribute set database 26 for each theme as a database. A pattern for extracting symbol and numeral attributes is created and set in the pattern database 23. Information for extracting a character string attribute is set in the dictionary database 24. Attribute relationship rules describing attribute relationships are stored in an attribute relationship rule database 27.
Set to.

【００１９】ステップ１０２）半構造化文書入力部２
１から半構造化文書を入力し、属性候補抽出部２２に転
送する。ステップ１０３）属性候補抽出部２２のパターンマッ
チ処理部２２１においてパターンデータベース２３を参
照して、入力された半構造化文書のある文字列について
パターンマッチ処理を行う。Step 102) Semi-structured document input unit 2
First, a semi-structured document is input and transferred to the attribute candidate extracting unit 22. Step 103) The pattern matching processing unit 221 of the attribute candidate extraction unit 22 refers to the pattern database 23 and performs a pattern matching process on a character string of the input semi-structured document.

【００２０】ステップ１０４）属性候補抽出部２３の
辞書マッチ処理部２２２において、文字列について、辞
書データベース２４を参照して辞書マッチ処理を行い、
属性候補を抽出する。なお、上記のステップ１０３とス
テップ１０４の処理順序は、逆であってもよい。Step 104) The dictionary match processing section 222 of the attribute candidate extraction section 23 performs a dictionary match process on the character string with reference to the dictionary database 24.
Extract attribute candidates. Note that the processing order of step 103 and step 104 may be reversed.

【００２１】ステップ１０５）属性候補抽出部２２で
抽出された属性候補（一時的なスプールに格納する）を
属性コスト計算部２５に送る。テーマ指定部２９から利
用者により決定されたテーマを属性コスト計算部２５に
入力し、属性コスト計算部２５は、属性候補に対して、
属性セットデータベース２６と属性関係ルールデータベ
ース２７を参照して優先度を付与する。Step 105) The attribute candidates (stored in the temporary spool) extracted by the attribute candidate extraction unit 22 are sent to the attribute cost calculation unit 25. The theme determined by the user is input from the theme designation unit 29 to the attribute cost calculation unit 25, and the attribute cost calculation unit 25
The priority is given with reference to the attribute set database 26 and the attribute relation rule database 27.

【００２２】ステップ１０６）属性コスト計算部２５
は、属性候補に付与された優先度が所定の閾値以下の候
補であるかを判定し、閾値以下である場合にはステップ
１０７に移行し、閾値より大きい場合には、ステップ１
０８に移行する。ステップ１０７）属性候補の優先度が低所定の閾値以
下の場合には、当該属性候補を削除する。Step 106) Attribute cost calculator 25
Determines whether the priority assigned to the attribute candidate is a candidate that is equal to or less than a predetermined threshold. If the priority is equal to or less than the threshold, the process proceeds to step 107;
08. Step 107) If the priority of the attribute candidate is equal to or less than the low predetermined threshold, the attribute candidate is deleted.

【００２３】ステップ１０８）属性候補の優先度が低
所定の閾値より大きい場合には、当該属性候補の各テー
マ毎の属性を決定し、元の文書（半構造化文書）にタグ
を付与し、構造化文書として構造化文書出力部２８から
出力する。Step 108) If the priority of the attribute candidate is lower than the predetermined threshold, the attribute of the attribute candidate is determined for each theme, and a tag is added to the original document (semi-structured document). The document is output from the structured document output unit 28 as a structured document.

【００２４】[0024]

【実施例】以下、図面と共に本発明の実施例を説明す
る。以下の説明では、前述の図４の処理に基づいて説明
する。図５は、本発明の一実施例の入力される半構造化
文書の例を示す。まず、図５に示す半構造化文書を半構
造化文書入力部２１から入力する（ステップ１０２）。Embodiments of the present invention will be described below with reference to the drawings. In the following description, description will be made based on the above-described processing of FIG. FIG. 5 shows an example of an input semi-structured document according to an embodiment of the present invention. First, the semi-structured document shown in FIG. 5 is input from the semi-structured document input unit 21 (step 102).

【００２５】入力された半構造化文書に対し、属性候補
抽出部２２において、属性候補抽出処理行う。パターン
マッチ処理部２２１は、パターンデータベース２３を参
照した結果、ＸＸ−ＸＸＸＸ−ＸＸＸＸというパターン
と、「電話番号」「ＦＡＸ番号」属性が対応している場
合、図５中の文字列、“０３−３３３Ｘ−０００Ｘ”と
パターンがマッチするため、“０３−３３３Ｘ−０００
Ｘ”には、「電話番号」と「ＦＡＸ番号」属性が候補と
して登録される（ステップ１０３）。The attribute candidate extraction unit 22 performs attribute candidate extraction processing on the input semi-structured document. When the pattern matching processing unit 221 refers to the pattern database 23 and finds that the pattern “XX-XXXX-XXXX” corresponds to the “phone number” and “FAX number” attributes, the character string “03- 333X-000X ”and the pattern match, so“ 03-333X-000 ”
In "X", the "telephone number" and "FAX number" attributes are registered as candidates (step 103).

【００２６】辞書マッチ処理部２２２は、辞書データベ
ース２４を参照して、辞書マッチ処理を行う。例えば、
「ハンバーグ」という単語は「メニュー」「商品」とい
う意味属性を持つ、と辞書データベース２４に記述され
ていた場合には、図５中の文字列「ハンバーグ」に対し
て、「メニュー」と「商品」という複数の属性が取得さ
れる（ステップ１０４）。The dictionary match processing unit 222 performs a dictionary match process with reference to the dictionary database 24. For example,
If the word “hamburger” is described in the dictionary database 24 as having the semantic attributes “menu” and “product”, the “menu” and “product” are added to the character string “hamburger” in FIG. Are acquired (step 104).

【００２７】属性コスト計算部２５において、各属性候
補の計算を行う。属性セットデータベース２６を参照し
て、テーマ毎の属性セットを取得する。この例を図６に
示す。例えば、図５に示す半構造化文書を「テーマ：レ
ストラン広告」という視点で構造化したい、とし、当該
テーマをテーマ指定部２９より入力する。対応する属性
セットとして、「店名、メニュー、住所、…」などが取
得できる（ステップ１０５）。The attribute cost calculator 25 calculates each attribute candidate. The attribute set for each theme is acquired with reference to the attribute set database 26. This example is shown in FIG. For example, the user wants to structure the semi-structured document shown in FIG. 5 from the viewpoint of “theme: restaurant advertisement”, and inputs the theme from the theme specifying unit 29. As a corresponding attribute set, “shop name, menu, address,...” Or the like can be obtained (step 105).

【００２８】ここで、ステップ１０３、１０４で抽出し
た属性候補の中で属性セットに含まれていなかった属性
候補を削除する（ステップ１０７）。次に、属性コスト
計算部２５は、属性関係ルールデータベース２７を参照
して得た属性セット中の属性間関係の共起、排他関係を
調べ、重み計算を行う。この例を図７に示す。図７の例
では、左に、属性（属性値）の組み合わせ、右に、その
重み付けコストが記述されている。コストはプラスが共
起ルール、マイナスが排他ルールである。ここで、図３
の例の、文字列「ハンバーグ」に対する属性候補「メニ
ュー」と「商品名」では、「メニュー」の方が合計の重
みが重いので、「ハンバーグ」に「メニュー」属性を付
与する（ステップ１０５）。Here, the attribute candidates that are not included in the attribute set among the attribute candidates extracted in steps 103 and 104 are deleted (step 107). Next, the attribute cost calculation unit 25 checks the co-occurrence and the exclusion of the relation between the attributes in the attribute set obtained by referring to the attribute relation rule database 27, and performs the weight calculation. This example is shown in FIG. In the example of FIG. 7, the combination of attributes (attribute values) is described on the left, and the weighting cost is described on the right. As for the cost, plus is the co-occurrence rule and minus is the exclusive rule. Here, FIG.
In the example, in the attribute candidates "menu" and "product name" for the character string "hamburger", the "menu" attribute is given to "hamburger" because the total weight of "menu" is heavier (step 105). .

【００２９】構造化文書出力部２２において、特定され
た属性を文字列に対してタグ付けして、出力する。例え
ば、図８に示すように、ＸＭＬ文書として出力する。ま
た、上記の実施例では、図３に示す構成に基づいて説明
しているが、この例に限定されることなく、半構造化文
書入力部、属性候補抽出部及び属性コスト計算部及び構
造化文書出力部をプログラムとして構築し、構造化文書
生成装置として使用されるコンピュータに接続されるデ
ィスク装置や、フロッピーディスク、ＣＤ−ＲＯＭ等の
可搬記憶媒体に格納しておき、本発明を実施する際にイ
ンストールすることにより、容易に本発明を実現するこ
とができる。In the structured document output unit 22, the specified attribute is tagged with a character string and output. For example, as shown in FIG. 8, it is output as an XML document. Although the above embodiment has been described based on the configuration shown in FIG. 3, the present invention is not limited to this example, and the semi-structured document input unit, attribute candidate extraction unit, attribute cost calculation unit, and structuring The document output unit is constructed as a program and stored in a disk device connected to a computer used as a structured document generation device, a floppy disk, a portable storage medium such as a CD-ROM, and the present invention is implemented. The present invention can be easily realized by installing the software at that time.

【００３０】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内において、種々変更・応
用が可能である。It should be noted that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible within the scope of the claims.

【００３１】[0031]

【発明の効果】上述のように、本発明によれば、非構造
化文書を構造化する際の属性判定基準をテーマによって
可変とし、検索時に検索者が選択したテーマ別の検索を
行うことができる。As described above, according to the present invention, the attribute determination criterion for structuring an unstructured document can be changed according to the theme, and a search can be performed for each theme selected by the searcher at the time of search. it can.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の文書構造化装置の構成図である。FIG. 3 is a configuration diagram of a document structuring apparatus of the present invention.

【図４】本発明の文書構造化装置の動作を示すフローチ
ャートである。FIG. 4 is a flowchart showing the operation of the document structuring apparatus of the present invention.

【図５】本発明の一実施例の入力される半構造化文書の
例である。FIG. 5 is an example of an input semi-structured document according to an embodiment of the present invention.

【図６】本発明の一実施例のテーマ別属性セットの例で
ある。FIG. 6 is an example of a theme-specific attribute set according to an embodiment of the present invention.

【図７】本発明の一実施例の属性関係ルールの例であ
る。FIG. 7 is an example of an attribute relation rule according to an embodiment of the present invention.

【図８】本発明の一実施例の出力の構造化文書の例であ
る。FIG. 8 is an example of a structured document output according to an embodiment of the present invention.

【図９】従来の文書構造化装置の構成図である。FIG. 9 is a configuration diagram of a conventional document structuring apparatus.

【符号の説明】２１半構造化文書入力手段、半構造化文書入力部２２属性候補抽出手段、属性候補抽出部２３パターン記憶手段、パターンデータベース２４辞書記憶手段、辞書データベース２５属性コスト計算手段、属性コスト計算部２６属性セット記憶手段、属性セットデータベース２７属性関係ルール記憶手段、属性関係ルールデータ
ベース２８構造化文書出力手段、構造化文書出力部２９テーマ指定手段、テーマ指定部２２１パターンマッチ処理部２２２辞書マッチ処理部[Description of Signs] 21 semi-structured document input unit, semi-structured document input unit 22 attribute candidate extraction unit, attribute candidate extraction unit 23 pattern storage unit, pattern database 24 dictionary storage unit, dictionary database 25 attribute cost calculation unit, attribute Cost calculation unit 26 Attribute set storage unit, attribute set database 27 Attribute relation rule storage unit, Attribute relation rule database 28 Structured document output unit, Structured document output unit 29 Theme specification unit, Theme specification unit 221 Pattern match processing unit 222 Dictionary Match processing section

───────────────────────────────────────────────────── フロントページの続き (72)発明者高橋克巳東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5B009 QA00 5B075 ND03 NK02 NK32 NK42 NK46 NR03 NR12 PP02 PP12 PP25 PR08 QM10 UU06 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Katsumi Takahashi 2-3-1 Otemachi, Chiyoda-ku, Tokyo F-term within Nippon Telegraph and Telephone Corporation (reference) 5B009 QA00 5B075 ND03 NK02 NK32 NK42 NK46 NR03 NR12 PP02 PP12 PP25 PR08 QM10 UU06

Claims

[Claims]

In a structured document generation method for structuring an unstructured document based on a theme for the purpose of searching documents by theme, a theme as a search viewpoint is set in advance, and a theme is set for each theme. When a semi-structured document is input, a pattern matching with a pattern registered in advance for a character string of the document is performed.
And dictionary matching with a word and an attribute dictionary in which a plurality of attribute names are described in correspondence with each other, to extract attribute candidates for the character string of the semi-structured document. By referring to the attribute, an attribute which may appear in the semi-structured document is obtained, and the attributes which may appear in the semi-structured document have a co-occurrence relationship or an exclusive relationship. The priority is given according to the co-occurrence relationship or the exclusive relationship with reference to the attribute relationship rule indicating whether the attribute candidate is present in the attribute candidate. Tagging a character string of the semi-structured document input based on the document, and outputting the result as a structured document.

2. The structured document generation method according to claim 1, wherein the priority of the attribute candidate is compared with a predetermined threshold, and if the priority is lower than the threshold, the attribute candidate is deleted.

3. A structured document generating apparatus for structuring an unstructured document based on a theme for the purpose of searching documents by theme, wherein a theme designating means for designating a theme which is a search viewpoint in advance. Attribute set storage means for storing a basic attribute set for each theme, Pattern storage means for storing patterns including symbols, character strings, and parts of speech, Words are associated with a plurality of attribute names A dictionary storage means for storing attribute relation rules indicating whether a certain attribute and another attribute are in a co-occurrence relation or in an exclusive relation; A semi-structured document input unit for inputting a document, and when a semi-structured document is input from the semi-structured document input unit, pattern matching is performed on the character string of the document by referring to the pattern storage unit. Performing, further performing dictionary matching with reference to the dictionary storage means, extracting attribute candidates for the character string of the semi-structured document, attribute candidate extraction means, for the attribute candidates extracted in the attribute candidate extraction means, By referring to the attribute set storage means, an attribute which may appear in the semi-structured document is obtained, and by referring to the attribute relation rule storage means, an attribute which may appear in the semi-structured document is obtained. Attribute cost calculation means for assigning priorities according to whether coexisting or exclusive relations between attributes having similarity, and adopting an attribute candidate having the higher priority as an attribute; Structured document output means for tagging a character string of the input semi-structured document based on the attribute and outputting the result as a structured document Write generating device.

4. The attribute cost calculation means includes a means for comparing the priority of the attribute candidate with a predetermined threshold, and deleting the attribute candidate when the priority is lower than the threshold. Item 3. The structured document generation device according to Item 3.

5. A storage medium storing a structured document generation program for structuring an unstructured document based on a theme for the purpose of searching for a document by theme, wherein a half-structured document is inputted. A structured document input process, when the semi-structured document is input, perform pattern matching by referring to a pattern including a symbol, a character string, and a part of speech registered in advance with respect to a character string of the document, An attribute candidate extraction process for performing dictionary matching by referring to a dictionary in which words and a plurality of attribute names are registered in advance in association with each other, and extracting attribute candidates for the character strings of the semi-structured document; The attribute candidates extracted in the extraction process are likely to appear in the semi-structured document with reference to a basic attribute set registered in advance for each theme. The attribute candidate that may appear in the semi-structured document has a co-occurrence relationship or an exclusive relationship with the attribute candidate. An attribute cost calculation process for assigning priorities in accordance with an originating relationship or an exclusive relationship and employing an attribute candidate having a higher priority as an attribute; and a character of the semi-structured document input based on the adopted attribute A structured document output program for tagging a column and outputting the structured document as a structured document.

6. The attribute cost calculation process includes a process of comparing the priority of the attribute candidate with a predetermined threshold, and deleting the attribute candidate if the priority is lower than the threshold. A storage medium storing the structured document generation program according to item 5.