JP5326781B2

JP5326781B2 - Extraction rule creation system, extraction rule creation method, and extraction rule creation program

Info

Publication number: JP5326781B2
Application number: JP2009110435A
Authority: JP
Inventors: 幸貴楠村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-04-30
Filing date: 2009-04-30
Publication date: 2013-10-30
Anticipated expiration: 2029-04-30
Also published as: JP2010262332A

Abstract

<P>PROBLEM TO BE SOLVED: To provide an extraction rule creation system for efficiently creating a rule for extracting information requested by a user. <P>SOLUTION: When a text with tag and character string position information being information showing the position of a character string in the text with tag are applied, an extraction rule creation means 82 combines a word or tag corresponding to position shown by the character string position information with words or tags before and after the word or tag, and creates an extraction rule being a rule for extracting information from the text with tag. An adequate sentence position information extraction means 83 extracts adequate sentence position information being information showing the position of an adequate sentence including the word or the tag adequate to the extraction rule by text with tag stored in a text with tag storage means 81. An evaluation value calculation means 84 calculates an evaluation value higher according as the number of adequate sentences appearing in one text with tag becomes smaller, and calculates the evaluation value higher according as the adequate sentence appears more in texts with tag. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、文書から情報を抽出するための抽出規則を作成する抽出規則作成システム、抽出規則作成方法及び抽出規則作成プログラムに関する。 The present invention relates to an extraction rule creation system, an extraction rule creation method, and an extraction rule creation program for creating an extraction rule for extracting information from a document.

ある文書の中から必要な情報を抽出する場合、ユーザが、何らかの規則（パターン）に合致する情報を抽出したいと考える場合がある。抽出したい情報のパターンが分かれば、そのパターンを用いて他の文書からも情報を抽出することが可能になる。 When extracting necessary information from a document, a user may want to extract information that matches some rule (pattern). If the pattern of information to be extracted is known, it is possible to extract information from other documents using the pattern.

例えば、ユーザが、文書内から人名を抽出する場合について考える。仮に、ユーザが、
人名の中から容疑者名を集めたいと考えている場合、ユーザが抽出したい情報のパターンは「「人名」＋容疑者」であると推測できる。また、ユーザが、姓が「ａｂ（ａｂは、姓を表す漢字２文字）」である人名を集めたいと考えている場合、ユーザが抽出したい情報のパターンは「ａｂ＋「名詞」」であると推測できる。このように、ユーザが抽出した情報のパターンが分かれば、そのパターンに合致する情報を他の文書からも抽出できるようになる。 For example, consider a case where a user extracts a person name from a document. If the user
When it is desired to collect suspect names from personal names, it can be assumed that the pattern of information that the user wants to extract is “person name” + suspect. In addition, when the user wants to collect personal names whose surname is “ab (ab is two kanji characters representing the surname)”, the information pattern that the user wants to extract is “ab +“ noun ””. I can guess. In this way, if the pattern of information extracted by the user is known, information that matches the pattern can be extracted from other documents.

特許文献１には、訓練用コーパスから情報抽出規則を簡易に生成する情報抽出規則生成装置が記載されている。特許文献１に記載された情報抽出規則生成装置は、木構造表示部が表示部に構文木を表示させる。作業者は、表示部に表示された構文木を参照しながら、マウス操作やキーボード操作に基づいてアノテーションを入力する。木構造正規表現抽出部は、木構造及びアノテーションをもとに、対応する規則を表示する木構造表現を抽出する。 Patent Document 1 describes an information extraction rule generation device that easily generates an information extraction rule from a training corpus. In the information extraction rule generation device described in Patent Document 1, the tree structure display unit displays a syntax tree on the display unit. The worker inputs an annotation based on a mouse operation or a keyboard operation while referring to the syntax tree displayed on the display unit. The tree structure regular expression extraction unit extracts a tree structure expression displaying a corresponding rule based on the tree structure and the annotation.

また、非特許文献１には、ルールの自動生成と対話的選択に基づく情報抽出ルール作成支援方法が記載されている。非特許文献１に記載された方法では、１つの事例から予め複数の抽出規則を自動作成し、各抽出規則に基づいて抽出処理を実行する。そして、抽出結果をユーザに提示した後、ユーザはその抽出結果に対する正否を対話的に入力することで、適切な抽出規則を絞り込む。これにより、ユーザは抽出結果の正否を入力するのみで、適切な抽出規則を作成することができる。 Non-Patent Document 1 describes an information extraction rule creation support method based on automatic rule generation and interactive selection. In the method described in Non-Patent Document 1, a plurality of extraction rules are automatically created in advance from one case, and extraction processing is executed based on each extraction rule. Then, after presenting the extraction result to the user, the user narrows down an appropriate extraction rule by interactively inputting whether the extraction result is correct or not. Thereby, the user can create an appropriate extraction rule only by inputting the correctness of the extraction result.

特開２００４−３１８８０９号（段落００２８〜００３２）JP 2004-318809 (paragraphs 0028-0032)

河合剛巨、安藤真一、「ルールの自動生成と対話的選択に基づく情報抽出ルール作成支援の提案」、言語処理学会第１３回年次大会論文集、言語処理学会、Ｄ３−１、２００７年３月Tsuyoshi Kawai, Shinichi Ando, "Proposal for Information Extraction Rule Creation Support Based on Automatic Rule Generation and Interactive Selection", Proc. Of the 13th Annual Conference of the Language Processing Society of Japan, D3-1, 2007 3 Moon

ユーザが文書内から人名を抽出する場合、ユーザは、人名に関連する何らかのパターンに合致する情報を欲していると考えられる。しかし、上述の例のように、ユーザが文書内から人名を抽出する場合、ユーザが抽出したい情報を表すパターンは複数推測できる。例えば、ユーザが人名の中から容疑者名を集めたい場合、ユーザが抽出したい情報のパターンは、「「人名」＋容疑者」になる。また、ユーザが、姓が「ａｂ（ａｂは、姓を表す漢字２文字）」である人名を集めたいと考えている場合には、ユーザが抽出したい情報のパターンは、「ａｂ＋「名詞」」になる。このように、「文書内から人名を抽出する」という情報だけでは、ユーザが欲する情報を抽出するためのパターンは複数推測されるため、情報を抽出するためのパターンを効率よく作成することは困難である場合が多い。 When a user extracts a person name from a document, the user is considered to want information that matches some pattern related to the person name. However, as in the above-described example, when a user extracts a person name from a document, a plurality of patterns representing information that the user wants to extract can be estimated. For example, when the user wants to collect suspect names from personal names, the pattern of information that the user wants to extract is ““ person name ”+ suspect”. When the user wants to collect personal names whose surname is “ab (ab is two kanji characters representing surname)”, the pattern of information that the user wants to extract is “ab +“ noun ””. become. Thus, it is difficult to efficiently create a pattern for extracting information because multiple patterns for extracting information desired by the user are inferred only by the information “extract a person's name from a document”. In many cases.

特許文献１に記載された情報抽出規則生成装置では、ユーザが、表示部に表示された構文木をもとに複雑な抽出規則を記述しなければならない。そのため、特許文献１に記載された装置を使って抽出規則を作成するためには、多くの時間を要してしまうという課題がある。また、特許文献１に記載された装置では、入力方法が複雑なため、ユーザが操作方法を理解しなければならないという課題がある。 In the information extraction rule generation device described in Patent Document 1, the user must describe a complicated extraction rule based on the syntax tree displayed on the display unit. Therefore, in order to create an extraction rule using the apparatus described in Patent Document 1, there is a problem that it takes a lot of time. Moreover, since the input method is complicated in the apparatus described in Patent Document 1, there is a problem that the user must understand the operation method.

また、非特許文献１に記載された方法では、適切な抽出規則を作成するために、ユーザは抽出結果に対する正否を入力するのみでよい。しかし、適切な抽出結果が得られるまで、ユーザは繰り返し抽出結果の正否を入力する必要がある。そのため、抽出規則を生成するためには、依然としてユーザの手間が大きいという問題がある。 Further, in the method described in Non-Patent Document 1, in order to create an appropriate extraction rule, the user only has to input correct / incorrect for the extraction result. However, until an appropriate extraction result is obtained, the user needs to repeatedly input the correctness of the extraction result. Therefore, there is still a problem that the user has a lot of trouble to generate the extraction rule.

そこで、本発明は、ユーザが欲する情報を抽出するための規則を効率よく作成することができる抽出規則作成システム、抽出規則作成方法及び抽出規則作成プログラムを提供することを目的とする。 Therefore, an object of the present invention is to provide an extraction rule creation system, an extraction rule creation method, and an extraction rule creation program that can efficiently create rules for extracting information desired by a user.

本発明による抽出規則作成システムは、文字列中の任意の位置に付加された情報であって、その情報が付加された文字列の位置を示す位置情報と、その位置に対応する単語の属性を示す属性情報とを表す情報であるタグの集合を含む文書であるタグ付きテキストを記憶するタグ付きテキスト記憶手段と、タグ付きテキスト及びそのタグ付きテキスト中の文字列の位置を示す情報である文字列位置情報が与えられたときに、その文字列位置情報が示す位置に対応する単語又はタグと、その単語又はタグの前後の単語又はタグとを組み合わせて、タグ付きテキストから情報を抽出するための規則である抽出規則を作成する抽出規則作成手段と、タグ付きテキスト記憶手段に記憶されたタグ付きテキストごとに、抽出規則に適合する単語又はタグを含む適合文の位置を示す情報である適合文位置情報を抽出する適合文位置情報抽出手段と、適合文位置情報をもとに、抽出規則を評価した値である評価値を算出する評価値算出手段とを備え、評価値算出手段が、１つのタグ付きテキスト内に現れる適合文がより少ないほど評価値を高く算出し、より多くのタグ付きテキスト内に適合文が現れるほど評価値を高く算出することを特徴とする。 The extraction rule creation system according to the present invention is information added to an arbitrary position in a character string, and includes position information indicating the position of the character string to which the information is added and an attribute of a word corresponding to the position. Tagged text storage means for storing tagged text that is a document including a set of tags that is information indicating attribute information to be indicated, and characters that are information indicating the position of the tagged text and a character string in the tagged text When column position information is given, a word or tag corresponding to the position indicated by the character string position information is combined with words or tags before and after the word or tag to extract information from the tagged text For each tagged text stored in the tagged text storage means, an extraction rule creating means for creating an extraction rule that is a rule of The matching sentence position information extracting means for extracting the matching sentence position information that is information indicating the position of the matching sentence, and the evaluation value calculation for calculating the evaluation value that is a value obtained by evaluating the extraction rule based on the matching sentence position information The evaluation value calculation means calculates the evaluation value higher as the number of matching sentences appearing in one tagged text is smaller, and calculates the evaluation value as the matching sentence appears in more tagged texts. It is characterized by doing.

本発明による抽出規則作成方法は、文字列中の任意の位置に付加された情報であって、その情報が付加された文字列の位置を示す位置情報と、その位置に対応する単語の属性を示す属性情報とを表す情報であるタグの集合を含む文書であるタグ付きテキスト及びそのタグ付きテキスト中の文字列の位置を示す情報である文字列位置情報が与えられたときに、その文字列位置情報が示す位置に対応する単語又はタグと、その単語又はタグの前後の単語又はタグとを組み合わせて、タグ付きテキストから情報を抽出するための規則である抽出規則を作成する抽出規則作成ステップと、タグ付きテキスト記憶手段に記憶されたタグ付きテキストごとに、抽出規則に適合する単語又はタグを含む適合文の位置を示す情報である適合文位置情報を抽出する適合文位置情報抽出ステップと、適合文位置情報をもとに、抽出規則を評価した値である評価値を算出する評価値算出ステップとを含み、評価値算出ステップで、１つのタグ付きテキスト内に現れる適合文がより少ないほど評価値を高く算出し、より多くのタグ付きテキスト内に適合文が現れるほど評価値を高く算出することを特徴とする。 The extraction rule creation method according to the present invention is information added to an arbitrary position in a character string, and includes position information indicating the position of the character string to which the information is added and an attribute of a word corresponding to the position. When a tagged text that is a document including a set of tags that is attribute information to be indicated and character string position information that is information indicating the position of the character string in the tagged text is given, the character string An extraction rule creating step for creating an extraction rule that is a rule for extracting information from tagged text by combining a word or tag corresponding to the position indicated by the position information and a word or tag before and after the word or tag. Then, for each tagged text stored in the tagged text storage means, the matching sentence position information which is the information indicating the position of the matching sentence including the word or the tag that matches the extraction rule is extracted. A matching sentence position information extraction step, and an evaluation value calculation step that calculates an evaluation value that is a value obtained by evaluating the extraction rule based on the matching sentence position information. In the evaluation value calculation step, within one tagged text The evaluation value is calculated to be higher as there are fewer conforming sentences appearing in, and the evaluation value is calculated to be higher as matching sentences appear in more tagged text.

本発明による抽出規則作成プログラムは、文字列中の任意の位置に付加された情報であって、その情報が付加された文字列の位置を示す位置情報と、その位置に対応する単語の属性を示す属性情報とを表す情報であるタグの集合を含む文書であるタグ付きテキストを記憶するタグ付きテキスト記憶手段を備えたコンピュータに適用される抽出規則作成プログラムであって、コンピュータに、タグ付きテキスト及びそのタグ付きテキスト中の文字列の位置を示す情報である文字列位置情報が与えられたときに、その文字列位置情報が示す位置に対応する単語又はタグと、その単語又はタグの前後の単語又はタグとを組み合わせて、タグ付きテキストから情報を抽出するための規則である抽出規則を作成する抽出規則作成処理、タグ付きテキスト記憶手段に記憶されたタグ付きテキストごとに、抽出規則に適合する単語又はタグを含む適合文の位置を示す情報である適合文位置情報を抽出する適合文位置情報抽出処理、適合文位置情報をもとに、抽出規則を評価した値である評価値を算出する評価値算出処理を実行させ、評価値算出処理で、１つのタグ付きテキスト内に現れる適合文がより少ないほど評価値を高く算出し、より多くのタグ付きテキスト内に適合文が現れるほど評価値を高く算出させることを特徴とする。 The extraction rule creation program according to the present invention is information added to an arbitrary position in a character string, and includes position information indicating the position of the character string to which the information is added and an attribute of a word corresponding to the position. An extraction rule creation program applied to a computer having tagged text storage means for storing tagged text, which is a document including a set of tags, which is information representing attribute information to be displayed. When the character string position information, which is information indicating the position of the character string in the tagged text, is given, the word or tag corresponding to the position indicated by the character string position information, and before and after the word or tag Extraction rule creation processing that creates extraction rules that are rules for extracting information from tagged text by combining words or tags, tagged text For each of the tagged text stored in the storage means, a matching sentence position information extraction process for extracting matching sentence position information, which is information indicating a position of a matching sentence including a word or tag that matches the extraction rule, and matching sentence position information Based on the evaluation value, the evaluation value calculation process for calculating the evaluation value, which is an evaluation value of the extraction rule, is executed. In the evaluation value calculation process, the lower the number of matching sentences that appear in one tagged text, the higher the evaluation value is calculated. In addition, the evaluation value is calculated to be higher as the matching sentence appears in more tagged text.

本発明によれば、ユーザが欲する情報を抽出するための規則を効率よく作成することができる。 According to the present invention, it is possible to efficiently create a rule for extracting information desired by a user.

タグ付きテキストの例を示す説明図である。It is explanatory drawing which shows the example of the text with a tag. 抽出規則を表現した構文の例を示す説明図である。It is explanatory drawing which shows the example of the syntax expressing the extraction rule. 本発明による抽出規則作成システムの一実施形態を示すブロック図である。It is a block diagram which shows one Embodiment of the extraction rule preparation system by this invention. 対象文書記憶部１１内のデータ形式の例を示す説明図である。It is explanatory drawing which shows the example of the data format in the target document memory | storage part. パターン合成部１２によって作成されたパターンの例を示す説明図である。It is explanatory drawing which shows the example of the pattern produced by the pattern synthetic | combination part. パターン評価部１３が算出した評価値の例を示す説明図である。It is explanatory drawing which shows the example of the evaluation value which the pattern evaluation part 13 calculated. 合成ステップの例を示すフローチャートである。It is a flowchart which shows the example of a synthetic | combination step. パターンを選び出す方法の例を示す状態遷移図である。It is a state transition diagram showing an example of a method for selecting a pattern. ステップＳ１０におけるアルゴリズムの例を示すフローチャートである。It is a flowchart which shows the example of the algorithm in step S10. ステップＳ４０におけるアルゴリズムの例を示すフローチャートである。It is a flowchart which shows the example of the algorithm in step S40. 評価ステップの例を示すフローチャートである。It is a flowchart which shows the example of an evaluation step. 検索処理の例を示すフローチャートである。It is a flowchart which shows the example of a search process. ステップＳ５２における判定処理の例を示すフローチャートである。It is a flowchart which shows the example of the determination process in step S52. ステップＳ１０〜Ｓ３０で作成したパターンの例を示す説明図である。It is explanatory drawing which shows the example of the pattern produced by step S10-S30. 本発明の最小構成を示すブロック図である。It is a block diagram which shows the minimum structure of this invention.

以下、本発明の説明で使用する用語について定義する。タグ付きテキストとは、少なくとも、文字列の集合である本文と、文字列中の任意の位置に付加されたタグの集合を含む文書である。図１は、タグ付きテキストの例を示す説明図である。図１に例示するタグ付きテキストは、「奈良県警は１４日、ａｂｃｄ容疑者（２０）を強盗殺人の容疑で逮捕したと発表。」という本文（ただし、ａｂは姓を表す漢字２文字、ｃｄは名を表す漢字２文字）と、本文中の各文字列に対して付加されたタグの集合を含んでいることを示す。 Hereinafter, terms used in the description of the present invention will be defined. The tagged text is a document including at least a body that is a set of character strings and a set of tags added at arbitrary positions in the character strings. FIG. 1 is an explanatory diagram illustrating an example of tagged text. The tagged text illustrated in FIG. 1 is the main text that says “Nara Prefectural Police arrested suspected abcd (20) on 14th for murder of robbery.” (Where ab is a two-character kanji character representing a surname, cd Indicates a set of tags added to each character string in the text.

タグとは、単語の属性を表す文字列（以下、タグ名と記す。）と、タグを付加する文字列の本文中の位置（開始位置及び終了位置）を含む情報である。開始位置及び終了位置は、タグが付加された文字列の本文中の位置を、例えば、文頭を「０」とし、文字間を数えた数で表現する。図１に例示するタグ付きテキストでは、「ａｂｃｄ」に付加されたタグを、タグ名「人名」、開始位置「９」、終了位置「１３」とするタグを用いて表現できる。また、以下の説明では、タグ名をＴとするタグを「Ｔタグ」と記すこともある。 The tag is information including a character string representing a word attribute (hereinafter referred to as a tag name) and a position (start position and end position) in the text of the character string to which the tag is added. The start position and the end position are represented by the number in the text of the character string to which the tag is added, for example, with the beginning of the sentence being “0” and the number of characters counted. In the tagged text illustrated in FIG. 1, the tag added to “abcd” can be expressed using tags with a tag name “person name”, a start position “9”, and an end position “13”. In the following description, a tag whose tag name is T may be referred to as a “T tag”.

指定抽出位置とは、ユーザ又は外部のプログラム等が抽出しようとする文字列を指定するタグ付きテキスト中の文字列の位置であり、開始位置及び終了位置によって表される情報である。例えば、図１に例示するタグ付きテキストに対し、ユーザが指定抽出位置として、開始位置９文字目、終了位置１３文字目を指定した場合、そのユーザは、文字列「ａｂｃｄ」を抽出したいと考えていることが分かる。 The designated extraction position is a position of a character string in the tagged text that designates a character string to be extracted by a user or an external program, and is information represented by a start position and an end position. For example, when the user designates the 9th character at the start position and the 13th character at the end position as the designated extraction position for the tagged text illustrated in FIG. 1, the user wants to extract the character string “abcd”. I understand that

事例とは、一つのタグ付きテキストと、そのタグ付きテキストに対して指定する指定抽出位置の組を表す情報であり、ユーザ又は外部のプログラム等により作成される。 A case is information representing a set of one tagged text and a designated extraction position designated for the tagged text, and is created by a user or an external program.

抽出規則とは、タグ付きテキストから情報を抽出するための規則である。この抽出規則を適切に作成することにより、ユーザが欲する情報をタグ付きテキストから抽出できるようになる。抽出規則は、少なくとも、文字列、タグ名及びワイルドカードの組み合わせ（テンプレートと記すこともある。）で表現され、そのテンプレート中には、指定抽出位置を示す情報を含む。 An extraction rule is a rule for extracting information from tagged text. By appropriately creating this extraction rule, information desired by the user can be extracted from the tagged text. The extraction rule is expressed by a combination of at least a character string, a tag name, and a wild card (sometimes referred to as a template), and the template includes information indicating a designated extraction position.

図２は、本発明における抽出規則を表現した構文の例を示す説明図である。図２に例示する構文では、抽出規則Ｒが、フレーズＰＨと抽出位置パターンＥＰとからなる列を一つ以上含む文字列で定義されていることを示す。フレーズＰＨとは、一つ以上の条件ＫＥＹからなる文字列である。また、条件ＫＥＹは、文字列そのもの、”［”と”］”で囲まれたタグ名、ワイルドカード（＊）、又は空文字（φ）のいずれかにより表現される。なお、抽出規則を表現した構文のことを、抽出するパターン（もしくは、単にパターン）と記すこともある。 FIG. 2 is an explanatory diagram showing an example of a syntax expressing the extraction rule in the present invention. The syntax illustrated in FIG. 2 indicates that the extraction rule R is defined by a character string including one or more columns each including the phrase PH and the extraction position pattern EP. The phrase PH is a character string composed of one or more condition keys. The condition KEY is expressed by any one of a character string itself, a tag name surrounded by “[” and “]”, a wild card (*), or a null character (φ). The syntax expressing the extraction rule may be described as a pattern to be extracted (or simply a pattern).

抽出位置パターンＥＰは、指定抽出位置におけるタグ付きテキストの構成要素を組み合わせたパターンである。具体的には、抽出位置パターンＥＰは、一つ以上タグを含み、文字列とタグからなるパターンである。図２に例示する構文では、抽出位置パターンＥＰが、構成要素ＥＰＨの前後を記号「＄」で囲んだ文字列として定義されていることを示す。構成要素ＥＰＨは、タグと条件ＥＫＥＹとからなる列を一つ以上含む文字列、又は、構成要素ＥＰＨ自身に、条件ＥＫＥＹを結合した文字列である。また、条件ＥＫＥＹは、文字列そのもの、又は、”［”と”］”で囲まれたタグ名、又は空文字（φ）のいずれかにより表現される。 The extraction position pattern EP is a pattern obtained by combining the components of the tagged text at the designated extraction position. Specifically, the extraction position pattern EP is a pattern including one or more tags and including a character string and a tag. The syntax illustrated in FIG. 2 indicates that the extraction position pattern EP is defined as a character string surrounded by the symbol “$” before and after the constituent element EPH. The component element EPH is a character string including one or more columns including a tag and a condition EKEY, or a character string obtained by combining the component element EPH with the condition EKEY. The condition EKEY is represented by either the character string itself, the tag name surrounded by “[” and “]”, or the empty character (φ).

抽出規則Ｒの例として、「＄［人名］＄容疑者」という規則が挙げられる。この抽出規則は、人名タグと、「容疑者」という文字列が隣接している部分から、人名タグが付加されている部分の文字列を抽出する、という規則であることを示す。 As an example of the extraction rule R, there is a rule of “$ [person name] $ suspect”. This extraction rule indicates that the character string of the part to which the personal name tag is added is extracted from the part where the personal name tag and the character string “suspect” are adjacent.

また、別の例として、「奈良＊＄ａｂ［名詞］＄」（ただし、ａｂは、姓を表す漢字２文字）という規則が挙げられる。この抽出規則は、「奈良」という文字列の後ろに任意の文字列（＊）が存在し、その後ろに「ａｂ」という文字列が存在し、さらにその直後に名詞タグが隣接している文から、「ａｂ」と名詞タグ部分の文字列とを合わせて抽出する、という規則であることを示す。 Another example is a rule of “Nara * $ ab [noun] $” (where ab is two kanji characters representing a surname). This extraction rule is a sentence in which an arbitrary character string (*) exists after the character string “Nara”, followed by a character string “ab”, and immediately followed by a noun tag. Therefore, it is indicated that the rule is that “ab” and the character string of the noun tag part are extracted together.

なお、以下の説明では、タグ付きテキストの集合と一つ以上の事例とをもとに抽出規則を作成する問題を抽出規則作成問題と呼ぶことがある。 In the following description, a problem of creating an extraction rule based on a set of tagged text and one or more cases may be referred to as an extraction rule creation problem.

以下、本発明の実施形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明による抽出規則作成システムの一実施形態を示すブロック図である。本発明における抽出規則作成システムは、対象文書記憶部１１と、パターン合成部１２と、パターン評価部１３と、文書検索部１４とを備えている。 FIG. 3 is a block diagram showing an embodiment of the extraction rule creation system according to the present invention. The extraction rule creation system according to the present invention includes a target document storage unit 11, a pattern synthesis unit 12, a pattern evaluation unit 13, and a document search unit 14.

対象文書記憶部１１は、抽出対象のタグ付きテキストの集合を記憶する記憶装置である。対象文書記憶部１１は、図１に例示したタグ付きテキストを任意の形式で保持する。図４は、対象文書記憶部１１がタグ付きテキストを記憶する形式の例を示す説明図である。図４に示す例では、対象文書記憶部１１が、本文テーブルとタグテーブルの二つのテーブルに分けて図１に例示したタグ付きテキストを記憶していることを示す。 The target document storage unit 11 is a storage device that stores a set of tagged text to be extracted. The target document storage unit 11 holds the tagged text illustrated in FIG. 1 in an arbitrary format. FIG. 4 is an explanatory diagram illustrating an example of a format in which the target document storage unit 11 stores tagged text. In the example illustrated in FIG. 4, the target document storage unit 11 stores the tagged text illustrated in FIG. 1 in two tables, a body table and a tag table.

本文テーブルは、タグ付きテキストを文ごとに記憶するテーブルである。本文テーブルは、タグ付きテキストのユニークな識別子である文書ＩＤと、文のユニークな識別子である文ＩＤとを、本文の文字列と対応付けて記憶する。 The body table is a table that stores tagged text for each sentence. The body table stores a document ID that is a unique identifier of tagged text and a sentence ID that is a unique identifier of a sentence in association with a character string of the body.

また、タグテーブルは、あるタグ付きテキストに付加されたすべてのタグを記憶するテーブルである。タグテーブルは、あるタグのタグ名と、本文の開始位置及び終了位置と、タグが付加された文書の文書ＩＤと、タグが付加された文の文ＩＤとを対応付けて記憶する。 The tag table is a table that stores all tags added to a certain tagged text. The tag table stores the tag name of a tag, the start position and end position of the text, the document ID of the document to which the tag is added, and the sentence ID of the sentence to which the tag is added in association with each other.

タグ付きテキストは、例えば、管理者によって予め対象文書記憶部１１に登録されていてもよく、また、後述の文書検索部１４によって対象文書記憶部１１に登録されてもよい。 For example, the tagged text may be registered in advance in the target document storage unit 11 by an administrator, or may be registered in the target document storage unit 11 by the document search unit 14 described later.

パターン合成部１２は、後述する方法により、一つ以上の事例（すなわち、タグ付きテキストと、そのタグ付きテキストに対して指定する指定抽出位置の組）が与えられたときに、指定抽出位置のタグ付きテキストの単語又はタグと、その単語又はタグの前後の単語又はタグとをもとに抽出規則の候補を合成（作成）する。図５は、パターン合成部１２によって作成されたパターンの例を示す説明図である。例えば、図１に例示するタグ付きテキストと、そのタグ付きテキストの９文字目から１３文字目を指定抽出位置とする事例が与えられると、パターン合成部１２は、その事例及びその指定抽出位置付近の単語又はタグとをもとに、図５に例示するパターンを作成する。 When one or more cases (that is, a set of tagged text and a designated extraction position designated for the tagged text) are given by the method described later, the pattern synthesis unit 12 Extraction rule candidates are synthesized (created) based on the word or tag of the tagged text and the word or tag before and after the word or tag. FIG. 5 is an explanatory diagram illustrating an example of a pattern created by the pattern synthesis unit 12. For example, given the tagged text illustrated in FIG. 1 and a case in which the 9th to 13th characters of the tagged text are designated extraction positions, the pattern composition unit 12 determines the case and the vicinity of the designated extraction position. The pattern illustrated in FIG. 5 is created based on the word or tag.

パターン評価部１３は、パターン合成部１２が作成したパターンが対象文書記憶部１１に記憶されたタグ付きテキスト内に発生する位置、及びその発生頻度の分布を調べ、この分布をもとに評価値を算出する。すなわち、パターン評価部１３は、対象文書記憶部１１に記憶されたタグ付きテキストごとに、パターン合成部１２が作成したパターンに適合する単語又はタグを含む文の位置を抽出し、その位置をもとに評価値を算出する。具体的には、パターン評価部１３は、対象文書記憶部１１に記憶されたタグ付きテキストごとに、パターン合成部１２が作成した抽出規則（パターン）に適合する単語又はタグを含む文の位置を抽出する。以下、抽出規則に適合する単語又はタグを含む文を適合文と記す。そして、パターン評価部１３は、１つのタグ付きテキスト内に現れる適合文がより少ないほどその抽出規則の評価値を高く算出し、より多くのタグ付きテキスト内に適合文が現れるほどその抽出規則の評価値を高く算出する。 The pattern evaluation unit 13 examines the position where the pattern created by the pattern synthesis unit 12 occurs in the tagged text stored in the target document storage unit 11 and the distribution of the occurrence frequency, and evaluates the evaluation value based on this distribution. Is calculated. That is, for each tagged text stored in the target document storage unit 11, the pattern evaluation unit 13 extracts the position of a sentence including a word or tag that matches the pattern created by the pattern synthesis unit 12, and stores the position. And the evaluation value is calculated. Specifically, the pattern evaluation unit 13 determines, for each tagged text stored in the target document storage unit 11, the position of a sentence including a word or tag that matches the extraction rule (pattern) created by the pattern synthesis unit 12. Extract. Hereinafter, a sentence including a word or tag that conforms to the extraction rule is referred to as a conforming sentence. Then, the pattern evaluation unit 13 calculates the evaluation value of the extraction rule higher as the number of matching sentences appearing in one tagged text is smaller, and as the matching sentence appears in more tagged texts, High evaluation value is calculated.

例えば、対象文書記憶部１１内のタグ付きテキスト内にパターンｐが登場した総数（すなわち、適合するパターンが含まれる数）をｐｆ（ｐ）、対象文書記憶部１１内のタグ付きテキスト内にパターンｐが登場した回数（すなわち、適合するパターンが含まれる文書の種類）をｄｆ（ｐ）とするとき、評価値ｉｐｆｄｆ（ｐ）は次の式１で算出できる。 For example, the total number of occurrences of the pattern p in the tagged text in the target document storage unit 11 (that is, the number that includes a matching pattern) is pf (p), and the pattern in the tagged text in the target document storage unit 11 Assuming that df (p) is the number of times p appears (that is, the type of the document including the matching pattern), the evaluation value ipfdf (p) can be calculated by the following equation 1.

以下の説明では、パターン評価部１３が式１を用いて評価値を算出する場合について説明する。なお、評価値の算出方法は、１つのタグ付きテキスト内に現れる適合文がより少ないほどその抽出規則の評価値を高く算出し、より多くのタグ付きテキスト内に適合文が現れるほどその抽出規則の評価値を高く算出するような他の式を用いて算出してもよい。 In the following description, a case where the pattern evaluation unit 13 calculates an evaluation value using Equation 1 will be described. The evaluation value calculation method calculates the evaluation value of the extraction rule higher as the number of matching sentences appearing in one tagged text is smaller, and the extraction rule as the matching sentence appears in more tagged texts. It may be calculated using another formula that calculates a high evaluation value of.

図６は、パターン評価部１３が算出した評価値の例を示す説明図である。パターン評価部１３は、例えば、図５に例示するパターンごとに、対象文書記憶部１１に記憶されたタグ付きテキスト内に発生する位置及びその発生頻度の分布を調べ、その分布をもとに評価値を算出する。 FIG. 6 is an explanatory diagram illustrating an example of evaluation values calculated by the pattern evaluation unit 13. For example, for each pattern illustrated in FIG. 5, the pattern evaluation unit 13 examines the distribution of the occurrence position and the occurrence frequency in the tagged text stored in the target document storage unit 11 and evaluates based on the distribution. Calculate the value.

文書検索部１４は、ユーザもしくは外部のプログラム等が検索クエリとして抽出条件（抽出対象のキーワード）を指定すると、その条件に該当するタグ付きテキストを抽出する。文書検索部１４は、例えば、指定した条件に該当するタグ付きテキストを抽出する文書検索システムなどにより実現される。抽出対象のタグ付きテキストは、文書検索部１４内の記憶部（図示せず）に、例えば、図４に例示するフォーマットで予め記憶されている。文書検索部１４は、その記憶部（図示せず）から条件を満たす文書ＩＤ及び文ＩＤのリストを抽出し、各文書ＩＤが示すタグ付きテキストを読み込み、そのタグ付きテキストを対象文書記憶部１１に登録してもよい。なお、対象文書記憶部１１にタグ付きテキストを登録するとは、対象文書記憶部１１にタグ付きテキストを記憶させることを意味する。 When the user or an external program or the like specifies an extraction condition (extraction target keyword) as a search query, the document search unit 14 extracts tagged text corresponding to the condition. The document search unit 14 is realized by, for example, a document search system that extracts tagged text corresponding to a specified condition. The tagged text to be extracted is stored in advance in a storage unit (not shown) in the document search unit 14, for example, in the format illustrated in FIG. The document search unit 14 extracts a list of document IDs and sentence IDs that satisfy a condition from the storage unit (not shown), reads the tagged text indicated by each document ID, and uses the tagged text as the target document storage unit 11. You may register with. Note that registering tagged text in the target document storage unit 11 means storing the tagged text in the target document storage unit 11.

また、上記記憶部に記憶されたフォーマットが、図４に例示するフォーマットと異なる場合、文書検索部１４は、抽出条件に該当するタグ付きテキストを、図４に例示するフォーマットに変換して、対象文書記憶部１１に登録してもよい。 If the format stored in the storage unit is different from the format illustrated in FIG. 4, the document search unit 14 converts the tagged text corresponding to the extraction condition into the format illustrated in FIG. It may be registered in the document storage unit 11.

上記内容により、文書検索部１４は、キーワードで表現された検索クエリを元に、抽出対象のタグ付きテキストの集合を選択的に収集し、対象文書記憶部１１内に挿入する処理を行うと言うことができる。 Based on the above contents, the document search unit 14 performs a process of selectively collecting a set of tagged text to be extracted and inserting it into the target document storage unit 11 based on a search query expressed by a keyword. be able to.

このように、文書検索部１４が、ユーザもしくは外部のプログラム等が指定した検索クエリをもとにタグ付きテキストを抽出し、抽出したタグ付きテキストを対象文書記憶部１１に登録することにより、対象文書記憶部１１内のタグ付きテキストの集合を変化させることができる。抽出対象のタグ付きテキストを変化させることにより、ユーザが欲するパターンの評価値をより高くできるため、効率よく抽出規則を取り出すことができる。 As described above, the document search unit 14 extracts the tagged text based on the search query designated by the user or an external program, and registers the extracted tagged text in the target document storage unit 11, thereby The set of tagged text in the document storage unit 11 can be changed. By changing the tagged text to be extracted, the evaluation value of the pattern desired by the user can be made higher, so that the extraction rule can be extracted efficiently.

パターン合成部１２と、パターン評価部１３と、文書検索部１４とは、プログラム（抽出規則作成プログラム）に従って動作するコンピュータのＣＰＵによって実現される。例えば、プログラムは、パターン合成部１２、パターン評価部１３及び文書検索部１４を含む装置が備える記憶部（図示せず）に記憶され、ＣＰＵは、そのプログラムを読み込み、プログラムに従って、パターン合成部１２、パターン評価部１３及び文書検索部１４として動作してもよい。また、パターン合成部１２と、パターン評価部１３と、文書検索部１４とは、それぞれが専用のハードウェアで実現されていてもよい。 The pattern synthesis unit 12, the pattern evaluation unit 13, and the document search unit 14 are realized by a CPU of a computer that operates according to a program (extraction rule creation program). For example, the program is stored in a storage unit (not shown) included in an apparatus including the pattern synthesis unit 12, the pattern evaluation unit 13, and the document search unit 14, and the CPU reads the program and follows the program according to the program. The pattern evaluation unit 13 and the document search unit 14 may be operated. Further, each of the pattern synthesis unit 12, the pattern evaluation unit 13, and the document search unit 14 may be realized by dedicated hardware.

次に、動作について説明する。以下の説明では、まず、パターン合成部１２が、入力された事例をもとにパターンを合成（作成）し、パターン評価部１３が、パターン合成部１２が作成した各パターンを評価する処理（以下、これらの処理をまとめて、事例入力プロセスと記すこともある。）について説明する。その後、文書検索部１４が、内部の記憶部（図示せず）に記憶するタグ付きテキストの中から、キーワードを本文に含むタグ付きテキストの集合を作成する処理（以下、この処理を、検索プロセスと記すこともある。）について説明する。 Next, the operation will be described. In the following description, first, the pattern synthesis unit 12 synthesizes (creates) a pattern based on the input case, and the pattern evaluation unit 13 evaluates each pattern created by the pattern synthesis unit 12 (hereinafter referred to as “pattern”). These processes are collectively referred to as a case input process.). Thereafter, the document search unit 14 creates a set of tagged text including the keyword in the body from the tagged text stored in the internal storage unit (not shown) (hereinafter, this process is referred to as a search process). Will be described.).

なお、以下の説明では、パターン合成部１２が、入力された事例をもとにパターンを合成（作成）する処理を、合成ステップと記し、パターン評価部１３が、パターン合成部１２が作成した各パターンを評価する処理を、評価ステップと記す。 In the following description, a process in which the pattern synthesis unit 12 synthesizes (creates) a pattern based on an input case is referred to as a synthesis step, and the pattern evaluation unit 13 creates each pattern created by the pattern synthesis unit 12. The process for evaluating the pattern is referred to as an evaluation step.

初めに、事例入力プロセスについて説明する。事例入力プロセスは、ユーザもしくは外部のプログラムがパターン合成部１２に事例を入力することにより処理を開始する。 First, the case input process will be described. The case input process starts when a user or an external program inputs a case to the pattern synthesis unit 12.

図７は、合成ステップの例を示すフローチャートである。パターン合成部１２は、ある事例（すなわち、タグ付きテキストと指定抽出位置を含む情報）が入力されると、まず、タグ付きテキストから指定抽出位置の文字列とタグをすべて取り出し、タグを一つ以上含むすべての組み合わせを抽出してパターンを作成する（ステップＳ１０）。 FIG. 7 is a flowchart showing an example of the synthesis step. When a certain example (that is, information including a tagged text and a specified extraction position) is input, the pattern synthesis unit 12 first extracts all the character strings and tags at the specified extraction position from the tagged text, and adds one tag. All combinations including the above are extracted to create a pattern (step S10).

図８は、パターン合成部１２がパターンを選び出す方法の例を示す状態遷移図である。図８に示す例では、事例として、図１に例示するタグ付きテキストと、指定抽出位置「９文字目から１３文字目」とが入力された場合について説明する。パターン合成部１２は、図１に例示するタグ付きテキストから、図８に例示する状態遷移のすべてのパターンを調べ、「ａｂｃｄ」，「ａｂ[名詞]」，「[名詞]ｃｄ」，「[名詞][名詞]」，「[人名]」という５つのパターンを取り出す。このうち、タグを一つ以上含む組合せとして、「ａｂ[名詞]」，「[名詞]ｃｄ」，「[名詞][名詞]」，「[人名]」の４つのパターンを抽出する。 FIG. 8 is a state transition diagram illustrating an example of a method by which the pattern synthesis unit 12 selects a pattern. In the example illustrated in FIG. 8, as an example, a case where the tagged text illustrated in FIG. 1 and the designated extraction position “9th to 13th characters” are input will be described. The pattern synthesis unit 12 examines all the patterns of state transitions illustrated in FIG. 8 from the tagged text illustrated in FIG. 1, and displays “abcd”, “ab [noun]”, “[noun] cd”, “[ Noun] [noun] ”and“ [person name] ”are extracted. Among these, four patterns of “ab [noun]”, “[noun] cd”, “[noun] [noun]”, and “[person name]” are extracted as combinations including one or more tags.

タグを含まないパターンは、特定の文字列（たとえば、「ａｂｃｄ」という文字列）しか収集できず、抽出規則としては効果が小さい。そのため、タグを一つ以上含む組合せを抽出することにより、無駄なパターンを予め省き、以後の計算量を小さくすることができる。 A pattern that does not include a tag can collect only a specific character string (for example, a character string “abcd”), and has a small effect as an extraction rule. Therefore, by extracting a combination including one or more tags, useless patterns can be omitted in advance, and the amount of calculation thereafter can be reduced.

ステップＳ１０における処理（アルゴリズム）について、図９を用いて説明する。図９は、ステップＳ１０におけるアルゴリズムの例を示すフローチャートである。図９に例示するアルゴリズムは、メソッド（generate）を再帰的に呼び出すことによってパターンを作成する。ここで、generateメソッドは、generateメソッドが呼び出された時点で作成されているパターンｎｏｗと、その時点の位置ｐｏｓと、終了位置ｅｎｄという３つの引数を持つ。 The process (algorithm) in step S10 will be described with reference to FIG. FIG. 9 is a flowchart illustrating an example of the algorithm in step S10. The algorithm illustrated in FIG. 9 creates a pattern by recursively calling a method (generate). Here, the generate method has three arguments: a pattern now created when the generate method is called, a position pos at that time, and an end position end.

初めに、パターン合成部１２は、generateメソッドの引数として、ｎｏｗ＝””（空）、ｐｏｓ＝指定抽出位置の開始位置、ｅｎｄ＝指定抽出位置の終了位置、を受け取り、generateメソッドが示す処理を開始する。generateメソッドにおいて、パターン合成部１２は、位置ｐｏｓが終了位置ｅｎｄより大きいかどうかを調べる（ステップＳ１１）。位置ｐｏｓが終了位置ｅｎｄより大きい場合（ステップＳ１１におけるＹＥＳ）、パターン合成部１２は、処理を終了する（ステップＳ１２）。位置ｐｏｓが終了位置ｅｎｄより大きくなく（ステップＳ１１におけるＮＯ）、位置ｐｏｓが終了位置ｅｎｄと等しい場合（ステップＳ１３におけるＹＥＳ）、パターン合成部１２は、状態遷移の終了位置に到達したと判定し、その時点のパターンｎｏｗを調べる。パターンｎｏｗにタグが一つでも含まれていれば、パターンｎｏｗの値をパターンとして抽出し、処理を終了する（ステップＳ１４）。 First, the pattern synthesis unit 12 receives, as arguments of the generate method, now = ”” (empty), pos = start position of the specified extraction position, end = end position of the specified extraction position, and performs the process indicated by the generate method. Start. In the generate method, the pattern synthesis unit 12 checks whether the position pos is larger than the end position end (step S11). If the position pos is larger than the end position end (YES in step S11), the pattern composition unit 12 ends the process (step S12). If the position pos is not larger than the end position end (NO in step S11) and the position pos is equal to the end position end (YES in step S13), the pattern composition unit 12 determines that the end position of the state transition has been reached, The pattern now at that time is examined. If even one tag is included in the pattern now, the value of the pattern now is extracted as a pattern, and the process is terminated (step S14).

一方、位置ｐｏｓが終了位置ｅｎｄと等しくない場合（ステップＳ１３におけるＮＯ）、パターン合成部１２は、位置ｐｏｓを開始位置とするタグＴをすべて取り出し、それらのタグＴを現在のパターンに追加する。また、パターン合成部１２は、現在の位置ｐｏｓをタグＴの終了位置に更新した情報でgenerateメソッドを呼び出す（ステップＳ１５）。さらに、パターン合成部１２は、現在の位置ｐｏｓの次の文字ｃを取り出す（ステップＳ１６）。そして、パターン合成部１２は、パターンｎｏｗにその文字ｃを追加し、現在の位置ｐｏｓに１を加算した情報でgenerateメソッドを呼び出す（ステップＳ１７）。 On the other hand, when the position pos is not equal to the end position end (NO in step S13), the pattern composition unit 12 extracts all the tags T having the position pos as the start position, and adds these tags T to the current pattern. In addition, the pattern synthesis unit 12 calls the generate method with information obtained by updating the current position pos to the end position of the tag T (step S15). Furthermore, the pattern composition unit 12 takes out the character c next to the current position pos (step S16). Then, the pattern synthesis unit 12 adds the character c to the pattern now and calls the generate method with information obtained by adding 1 to the current position pos (step S17).

generateメソッドは、上記内容を実行するための処理である。すなわち、generateメソッドは、指定抽出位置を右に移動させながらパターンを作成する処理であると言える。 The generate method is a process for executing the above contents. That is, it can be said that the generate method is a process of creating a pattern while moving the designated extraction position to the right.

次に、パターン合成部１２は、指定抽出位置の右側Ｒ文字の文字列とタグとを組み合わせたパターン、及び、指定抽出位置の左側Ｌ文字の文字列とタグとを組み合わせたパターンを作成する（図７におけるステップＳ２０）。ＲとＬの値は、例えば、予めユーザや開発者によって指定される任意の整数である。 Next, the pattern synthesizing unit 12 creates a pattern in which the character string of the right R character at the designated extraction position and the tag are combined, and a pattern in which the character string of the left L character at the designated extraction position and the tag are combined ( Step S20 in FIG. The values of R and L are, for example, arbitrary integers designated in advance by the user or developer.

指定抽出位置の右側Ｒ文字（左側Ｌ文字）の文字列とタグを組み合わせたパターンは、図９に例示するアルゴリズムと同様のアルゴリズムによって作成できるため、詳細な説明は省略する。すなわち、指定抽出位置の右側Ｒ文字のパターンを作成する場合、パターン合成部１２は、generateメソッドの引数を、ｎｏｗ＝””（空）、ｐｏｓ＝指定抽出位置の終了位置、ｅｎｄ＝指定抽出位置の終了位置＋Ｒとして処理を行えばよい。また、指定抽出位置の左側Ｌ文字のパターンを作成する場合、パターン合成部１２は、generateメソッドの引数を、ｎｏｗ＝””（空）、ｐｏｓ＝指定抽出位置の開始位置−Ｌ、ｅｎｄ＝指定抽出位置の開始位置として処理を行えばよい。 A pattern in which a character string of a right R character (left L character) at a designated extraction position and a tag are combined can be created by an algorithm similar to the algorithm illustrated in FIG. That is, when creating a pattern of R characters on the right side of the designated extraction position, the pattern synthesis unit 12 sets the generate method arguments as “now =” ”(empty), pos = end position of the designated extraction position, end = designated extraction position The processing may be performed as the end position + R. Further, when creating a pattern of L characters on the left side of the designated extraction position, the pattern synthesis unit 12 sets the generate method arguments as “now =” ”(empty), pos = start position of the designated extraction position−L, end = designation. Processing may be performed as the start position of the extraction position.

なお、指定抽出位置の右側Ｒ文字（左側Ｌ文字）には、タグが含まれていなくてもよいため、この場合、パターン合成部１２は、ステップＳ１４において、パターンｎｏｗにタグが含まれているか否か判断してなくてよい。 Since the right R character (left L character) of the designated extraction position does not need to include a tag, in this case, the pattern synthesis unit 12 determines whether the pattern now includes a tag in step S14. It is not necessary to judge whether or not.

次に、パターン合成部１２は、名詞タグを元にパターンを作成する（図７におけるステップＳ３０）。具体的には、パターン合成部１２は、指定抽出位置、指定抽出位置の左側Ｌ文字、及び指定抽出位置の右側Ｒ文字以外に登場する名詞をすべて取り出す。そして、パターン合成部１２は、これらの名詞のうち、指定抽出位置に対して左側にある名詞を左パターン語Ｐｌｗ、指定抽出位置に対して右側にある名詞を右パターン語Ｐｒｗとして抽出する。 Next, the pattern synthesis unit 12 creates a pattern based on the noun tag (step S30 in FIG. 7). Specifically, the pattern synthesis unit 12 extracts all nouns that appear in addition to the designated extraction position, the left L character of the designated extraction position, and the right R character of the designated extraction position. Then, the pattern synthesizing unit 12 extracts the noun on the left side of the designated extraction position as the left pattern word Plw and the noun on the right side of the designated extraction position among these nouns as the right pattern word Prw.

なお、以下の説明では、パターン合成部１２が名詞タグをもとにパターンを作成（すなわち、本文中から名詞を取り出す）場合について説明するが、パターン合成部１２がパターンを作成する対象とするタグは、名詞タグに限られない。例えば、名詞以外の動詞、形容詞、形容動詞などの自立語であってもよい。このような語を利用してパターンを作成することにより、助詞のように一般的で無意味な語をパターンから排除することができる。 In the following description, a case where the pattern composition unit 12 creates a pattern based on a noun tag (that is, a noun is extracted from the text) will be described. Is not limited to noun tags. For example, it may be an independent word such as a verb other than a noun, an adjective, or an adjective verb. By creating a pattern using such words, general and meaningless words such as particles can be excluded from the pattern.

パターン合成部１２は、ステップＳ１０において作成されたパターンの一覧（以下、リストＡと記す。）と、ステップＳ２０において作成された右パターンの一覧（以下、リストＲＰと記す。）及び左パターンの一覧（以下、リストＬＰと記す。）と、ステップＳ３０において作成された左パターン語の一覧（以下、リストＬＷと記す。）及び右パターン語の一覧（以下、リストＲＷと記す。）とを合成して、抽出規則の候補を作成する（ステップＳ４０）。 The pattern synthesis unit 12 includes a list of patterns created in step S10 (hereinafter referred to as list A), a list of right patterns created in step S20 (hereinafter referred to as list RP), and a list of left patterns. (Hereinafter referred to as a list LP) and a list of left pattern words (hereinafter referred to as a list LW) created in step S30 and a list of right pattern words (hereinafter referred to as a list RW) are synthesized. Then, extraction rule candidates are created (step S40).

ステップＳ４０における処理（アルゴリズム）について、図１０を用いて説明する。図１０は、ステップＳ４０におけるアルゴリズムの例を示すフローチャートである。図１０に例示するアルゴリズムでは、まずパターン合成部１２は、リストＲＰ、リストＬＰ、リストＬＷ及びリストＲＷに対して、空文字””を追加する（ステップＳ４１）。この空文字””は、そのリストに含まれるパターンを利用しないことを意味するものである。パターン合成部１２は、各リスト（すなわち、リストＡ、リストＲＰ、リストＬＰ、リストＬＷ及びリストＲＷ）からそれぞれ一つのパターンを取り出し、取り出した各リストのパターンのすべての組合せに対して以下のステップＳ４２〜Ｓ４７に示す処理を行う。 The process (algorithm) in step S40 will be described with reference to FIG. FIG. 10 is a flowchart illustrating an example of the algorithm in step S40. In the algorithm illustrated in FIG. 10, the pattern synthesis unit 12 first adds a null character “” to the list RP, list LP, list LW, and list RW (step S41). This empty character “” means that a pattern included in the list is not used. The pattern synthesis unit 12 extracts one pattern from each list (ie, list A, list RP, list LP, list LW, and list RW), and performs the following steps for all combinations of the extracted patterns of each list. The processing shown in S42 to S47 is performed.

パターン合成部１２は、リストＡから取り出したパターンに指定抽出位置を示す記号である「＄」を追加したパターンＲを作成する（ステップＳ４２）。次に、パターン合成部１２は、リストＲＰから取り出したパターンＰｒを、パターンＲの右側に追加する（ステップＳ４３）。同様に、パターン合成部１２は、リストＬＰから取り出したパターンＰｌを、パターンＲの左側に追加する（ステップＳ４４）。次に、パターン合成部１２は、リストＲＷから取り出したパターンＰｒｗの左側にワイルドカード「＊」を付加したパターンを、パターンＲの右側に追加する（ステップＳ４５）。同様に、パターン合成部１２は、リストＬＷから取り出したパターンＰｌｗの右側にワイルドカード「＊」を付加したパターンを、パターンＲの左側に追加する（ステップＳ４６）。最後に、パターン合成部１２は、作成したパターンＲをパターン評価部１３に通知する（ステップＳ４７）。このように、パターン合成部１２は、与えられた事例をもとに作成したパターンをパターン評価部１３に通知する。 The pattern synthesis unit 12 creates a pattern R in which “$”, which is a symbol indicating the designated extraction position, is added to the pattern extracted from the list A (step S42). Next, the pattern synthesis unit 12 adds the pattern Pr extracted from the list RP to the right side of the pattern R (step S43). Similarly, the pattern synthesis unit 12 adds the pattern Pl extracted from the list LP to the left side of the pattern R (step S44). Next, the pattern synthesis unit 12 adds a pattern with a wild card “*” added to the left side of the pattern Prw extracted from the list RW to the right side of the pattern R (step S45). Similarly, the pattern composition unit 12 adds a pattern with a wild card “*” added to the right side of the pattern Plw extracted from the list LW to the left side of the pattern R (step S46). Finally, the pattern synthesis unit 12 notifies the pattern evaluation unit 13 of the created pattern R (step S47). As described above, the pattern synthesis unit 12 notifies the pattern evaluation unit 13 of a pattern created based on the given case.

次に、評価ステップの動作について説明する。図１１は、評価ステップの例を示すフローチャートである。パターン評価部１３は、パターン合成部１２から各パターンを受け取ると、パターンに適合する文を探し、適合する文の文書ＩＤと文ＩＤの組を抽出する（ステップＳ５０）。そして、パターン評価部１３は、抽出した文書ＩＤ及び文ＩＤから、受け取ったパターンの評価値を算出する（ステップＳ６０）。なお、以下の説明では、ステップＳ５０における処理を検索処理、ステップＳ６０における処理を、評価値算出処理と記す。 Next, the operation of the evaluation step will be described. FIG. 11 is a flowchart illustrating an example of the evaluation step. When the pattern evaluation unit 13 receives each pattern from the pattern synthesis unit 12, the pattern evaluation unit 13 searches for a sentence that matches the pattern and extracts a combination of the document ID and the sentence ID of the matching sentence (step S50). Then, the pattern evaluation unit 13 calculates the evaluation value of the received pattern from the extracted document ID and sentence ID (step S60). In the following description, the process in step S50 is referred to as a search process, and the process in step S60 is referred to as an evaluation value calculation process.

検索処理について、図１２を用いて説明する。図１２は、検索処理の例を示すフローチャートである。パターン評価部１３は、対象文書記憶部１１に記憶された本文テーブル内の各レコードを順に読み込み、各レコードの文書ＩＤ及び文ＩＤに対応するタグをタグテーブルから読み込む（ステップＳ５１）。次にパターン評価部１３は、各文及び各タグと入力されたパターンとを比較し（ステップＳ５２）、両者の間にマッチングが成立（すなわち、パターンに適合する文、又はタグが存在）したときに、その文の文書ＩＤと文ＩＤとを抽出する（ステップＳ５３）。 The search process will be described with reference to FIG. FIG. 12 is a flowchart illustrating an example of search processing. The pattern evaluation unit 13 sequentially reads each record in the body table stored in the target document storage unit 11, and reads a tag corresponding to the document ID and sentence ID of each record from the tag table (step S51). Next, the pattern evaluation unit 13 compares each sentence and each tag with the input pattern (step S52), and when matching is established between them (that is, a sentence or tag that matches the pattern exists). Then, the document ID and sentence ID of the sentence are extracted (step S53).

なお、検索処理を行う方法は、上記方法に限定されない。パターン合成部１２が作成したパターンに適合する文の位置を抽出できる方法であれば、他の方法であってもよい。 The method for performing the search process is not limited to the above method. Any other method may be used as long as it can extract the position of a sentence that matches the pattern created by the pattern synthesis unit 12.

ステップＳ５２において、両者の間にマッチングが成立したか否か判定する処理について、図１３を用いて説明する。図１３は、ステップＳ５２における判定処理の例を示すフローチャートである。パターン評価部１３は、パターン合成部１２から受け取ったパターンＰと、ステップＳ５１で読み込んだ文Ｓ及びその文Ｓに付加されたタグの集合ＴＬｉｓｔをもとに、以下のステップＳ７１〜Ｓ８５の処理を行う。 Processing for determining whether or not matching has been established between the two in step S52 will be described with reference to FIG. FIG. 13 is a flowchart illustrating an example of the determination process in step S52. The pattern evaluation unit 13 performs the following steps S71 to S85 based on the pattern P received from the pattern synthesis unit 12, the sentence S read in step S51, and the tag set TList added to the sentence S. Do.

パターン評価部１３は、予め定められた構文（”［”, ”］”，”＊”，”＄”の記号）に基づき、パターンＰを、タグ名、文字列及びワイルドカードの各条件に区切り、区切った条件の一覧（以下、条件列リストＬｉｓｔと記す。）を作成する（ステップＳ７１）。例えば、「＄［人名］＄容疑者＊逮捕」というパターンの場合、パターン評価部１３は、「［人名］」、「容疑者」、「＊」、「逮捕」という４つの条件に区切ることができる。パターン評価部１３は、区切った条件を、パターンの先頭から順に条件列リストＬｉｓｔに格納する。 The pattern evaluation unit 13 divides the pattern P into tag name, character string, and wildcard conditions based on a predetermined syntax (the symbols “[”, “]”, “*”, “$”). Then, a list of separated conditions (hereinafter referred to as a condition column list List) is created (step S71). For example, in the case of a pattern of “$ [person name] $ suspect * arrest”, the pattern evaluation unit 13 may divide into four conditions of “[person name]”, “suspect”, “*”, and “arrest”. it can. The pattern evaluation unit 13 stores the separated conditions in the condition column list List in order from the top of the pattern.

次に、パターン評価部１３は、文中の位置を表す変数ｉを０に、ワイルドカードの有無を表す変数ｆｌａｇを１にそれぞれ初期化する（ステップＳ７２）。そして、パターン評価部１３は、条件列リスＬｉｓｔから先頭の条件Ｃを取り出す（ステップＳ７３）。取り出した条件Ｃがタグ名の場合（ステップＳ７４におけるＹＥＳ）、パターン評価部１３は、変数ｆｌａｇが１であるか否か調べる（ステップＳ７５）。変数ｆｌａｇが１の場合（ステップＳ７５におけるＹＥＳ）、この状態は、ワイルドカードが有効であるか、パターンの先頭であるかのいずれかであると言える。この場合、パターン評価部１３は、タグの集合ＴＬｉｓｔの中から、開始位置が変数ｉより大きく、条件Ｃで指定されたタグＴが存在するか否か調べる（ステップＳ７６）。条件Ｃで指定されたタグＴがタグの集合ＴＬｉｓｔに存在する場合（ステップＳ７６におけるＹＥＳ）、パターン評価部１３は、タグＴが条件Ｃに一致したと判断して、変数ｉにタグＴの終了位置を代入する（ステップＳ７８）。 Next, the pattern evaluation unit 13 initializes a variable i representing a position in the sentence to 0 and a variable flag representing the presence / absence of a wild card to 1 (step S72). Then, the pattern evaluation unit 13 extracts the first condition C from the condition string list List (step S73). If the extracted condition C is a tag name (YES in step S74), the pattern evaluation unit 13 checks whether or not the variable flag is 1 (step S75). If the variable flag is 1 (YES in step S75), it can be said that this state is either that the wild card is valid or the beginning of the pattern. In this case, the pattern evaluation unit 13 checks whether or not the tag T specified by the condition C and the start position is larger than the variable i from the tag set TList (step S76). When the tag T specified by the condition C exists in the tag set TList (YES in step S76), the pattern evaluation unit 13 determines that the tag T matches the condition C, and ends the tag T in the variable i. The position is substituted (step S78).

一方、ステップＳ７６において、条件Ｃで指定されたタグＴがタグの集合ＴＬｉｓｔに存在しない場合（ステップＳ７６におけるＮＯ）、パターン評価部１３は、パターンが不一致である旨の情報を出力し、処理を終了する。 On the other hand, when the tag T specified by the condition C does not exist in the tag set TList in step S76 (NO in step S76), the pattern evaluation unit 13 outputs information indicating that the patterns do not match, and performs processing. finish.

また、ステップＳ７５において、変数ｆｌａｇが１でない場合（ステップＳ７５におけるＮＯ）、ワイルドカードは有効でないことが分かる。この場合、パターン評価部１３は、タグの集合ＴＬｉｓｔの中から、開始位置が変数ｉと等しく、条件Ｃで指定されたタグＴが存在するか否か調べる（ステップＳ７７）。条件Ｃで指定されたタグＴがタグの集合ＴＬｉｓｔに存在する場合（ステップＳ７７におけるＹＥＳ）、パターン評価部１３は、タグＴが条件Ｃに一致したと判断して、変数ｉにタグＴの終了位置を代入する（ステップＳ７８）。一方、ステップＳ７７において、条件Ｃで指定されたタグＴがタグの集合ＴＬｉｓｔに存在しない場合（ステップＳ７７におけるＮＯ）、パターン評価部１３は、パターンが不一致である旨の情報を出力し、処理を終了する。 In step S75, if the variable flag is not 1 (NO in step S75), it is understood that the wild card is not valid. In this case, the pattern evaluation unit 13 checks whether or not the tag T specified by the condition C exists and the start position is equal to the variable i from the tag set TList (step S77). When the tag T specified by the condition C exists in the tag set TList (YES in step S77), the pattern evaluation unit 13 determines that the tag T matches the condition C, and ends the tag T in the variable i. The position is substituted (step S78). On the other hand, if the tag T specified by the condition C does not exist in the tag set TList in step S77 (NO in step S77), the pattern evaluation unit 13 outputs information indicating that the patterns do not match, and performs processing. finish.

ステップＳ７４において、パターン評価部１３が取り出した条件がタグ名でない場合（ステップＳ７４におけるＮＯ）、パターン評価部１３は、条件Ｃが文字列か否か調べる（ステップＳ７９）。条件Ｃが文字列の場合（ステップＳ７９におけるＹＥＳ）、パターン評価部１３は、変数ｆｌａｇが１であるか否か調べる（ステップＳ８０）。変数ｆｌａｇが１の場合（ステップＳ８０におけるＹＥＳ）、ワイルドカードは有効であることが分かる。この場合、パターン評価部１３は、読み込んだ文Ｓのｉ文字目以降に条件Ｃとして指定された文字列Ｗが存在するか否か調べる（ステップＳ８１）。文字列Ｗが存在する場合（ステップＳ８１におけるＹＥＳ）、パターン評価部１３は、文字列Ｗが条件Ｃに一致したと判断して、変数ｉに文字列Ｗの文字数を加算する（ステップＳ８３）。一方、ステップＳ８１において、文字列Ｗが存在しない場合（ステップＳ８１におけるＮＯ）、パターン評価部１３は、パターンが不一致である旨の情報を出力し、処理を終了する。 In step S74, when the condition extracted by the pattern evaluation unit 13 is not a tag name (NO in step S74), the pattern evaluation unit 13 checks whether the condition C is a character string (step S79). When the condition C is a character string (YES in step S79), the pattern evaluation unit 13 checks whether or not the variable flag is 1 (step S80). If the variable flag is 1 (YES in step S80), it can be seen that the wild card is valid. In this case, the pattern evaluation unit 13 checks whether or not the character string W specified as the condition C exists after the i-th character of the read sentence S (step S81). When the character string W exists (YES in step S81), the pattern evaluation unit 13 determines that the character string W matches the condition C, and adds the number of characters of the character string W to the variable i (step S83). On the other hand, if the character string W does not exist in step S81 (NO in step S81), the pattern evaluation unit 13 outputs information indicating that the patterns do not match, and ends the process.

また、ステップＳ８０において、変数ｆｌａｇが１でない場合（ステップＳ８０におけるＮＯ）、ワイルドカードは有効でないことが分かる。この場合、パターン評価部１３は、読み込んだ文Ｓのｉ文字目に条件Ｃとして指定された文字列Ｗが存在するか否か調べる（ステップＳ８２）。文字列Ｗが存在する場合（ステップＳ８２におけるＹＥＳ）、パターン評価部１３は、文字列Ｗが条件Ｃに一致したと判断して、変数ｉに文字列Ｗの文字数を加算する（ステップＳ８３）。一方、ステップＳ８２において、文字列Ｗが存在しない場合（ステップＳ８２におけるＮＯ）、パターン評価部１３は、パターンが不一致である旨の情報を出力し、処理を終了する。 In step S80, if the variable flag is not 1 (NO in step S80), it is understood that the wild card is not valid. In this case, the pattern evaluation unit 13 checks whether or not the character string W designated as the condition C exists in the i character of the read sentence S (step S82). When the character string W exists (YES in step S82), the pattern evaluation unit 13 determines that the character string W matches the condition C, and adds the number of characters of the character string W to the variable i (step S83). On the other hand, if the character string W does not exist in step S82 (NO in step S82), the pattern evaluation unit 13 outputs information indicating that the patterns do not match, and ends the process.

ステップＳ７８で、変数ｉにタグＴの終了位置を代入した後、又は、ステップＳ８３で、変数ｉに文字列Ｗの文字数を加算した後、パターン評価部１３は、条件リストＬｉｓｔが空か否か（すなわち、条件Ｃの取り出しが完了したか否か）判断する（ステップＳ８４）。条件リストＬｉｓｔが空である場合（ステップＳ８４におけるＹＥＳ）、パターン評価部１３は、すべての条件Ｃが満たされたと判断し、パターンが一致した旨の情報を出力し、処理を終了する。一方、条件リストＬｉｓｔが空でない場合（ステップＳ８４におけるＮＯ）、パターン評価部１３は、変数ｆｌａｇに０を代入して（ステップＳ８５）、ステップＳ７３以降の処理を繰り返す。 After substituting the end position of the tag T for the variable i in step S78, or after adding the number of characters of the character string W to the variable i in step S83, the pattern evaluation unit 13 determines whether the condition list List is empty. (That is, whether or not the extraction of the condition C has been completed) is determined (step S84). If the condition list List is empty (YES in step S84), the pattern evaluation unit 13 determines that all the conditions C are satisfied, outputs information indicating that the patterns match, and ends the process. On the other hand, when the condition list List is not empty (NO in step S84), the pattern evaluation unit 13 substitutes 0 for the variable flag (step S85), and repeats the processing after step S73.

また、ステップＳ７９において、条件Ｃが文字列でない場合（ステップＳ７９におけるＮＯ）、条件Ｃはワイルドカードである。よって、パターン評価部１３は、ｆｌａｇに１を代入して（ステップＳ８６）、ステップＳ７３以降の処理を繰り返す。 In step S79, if condition C is not a character string (NO in step S79), condition C is a wild card. Therefore, the pattern evaluation unit 13 substitutes 1 for flag (step S86), and repeats the processing after step S73.

このようにして、パターン評価部１３は、パターン合成部１２から受け取ったパターンＰが、ステップＳ５１で読み込んだ文Ｓ及びその文Ｓに付加されたタグの集合ＴＬｉｓｔと一致するか否かを示す情報を出力できる。 In this manner, the pattern evaluation unit 13 indicates whether or not the pattern P received from the pattern synthesis unit 12 matches the sentence S read in step S51 and the tag set TList added to the sentence S. Can be output.

次に、評価値算出処理について説明する。評価値算出処理では、パターン評価部１３は、検索処理において抽出された文書ＩＤと文ＩＤの組から、ｐｆ（ｐ）とｄｆ（ｐ）とを算出する。パターン評価部１３は、例えば、ｐｆ（ｐ）の値を、文書ＩＤと文ＩＤの組の数を数えることで算出し、ｄｆ（ｐ）の値を、文書ＩＤと文ＩＤの組のうち、文書ＩＤの種類数を数えることで算出する。パターン評価部１３は、例えば、式１で示した式に基づき評価値を算出する。 Next, the evaluation value calculation process will be described. In the evaluation value calculation process, the pattern evaluation unit 13 calculates pf (p) and df (p) from the set of document ID and sentence ID extracted in the search process. For example, the pattern evaluation unit 13 calculates the value of pf (p) by counting the number of pairs of document ID and sentence ID, and calculates the value of df (p) among the pair of document ID and sentence ID. Calculation is performed by counting the number of types of document IDs. For example, the pattern evaluation unit 13 calculates an evaluation value based on the equation shown in Equation 1.

なお、パターン合成部１２及びパターン評価部１３は、各処理の論理的前後関係が変わらない限り、事例入力プロセス内の処理を任意の順序で行ってもよい。 Note that the pattern synthesis unit 12 and the pattern evaluation unit 13 may perform the processes in the case input process in an arbitrary order as long as the logical context of each process does not change.

次に、検索プロセスについて説明する。検索プロセスは、ユーザもしくは外部のプログラムが文書検索部１４にクエリ（すなわち、条件）を入力することにより処理を開始する。 Next, the search process will be described. The search process starts when a user or an external program inputs a query (that is, a condition) to the document search unit 14.

文書検索部１４は、クエリとして抽出対象のキーワードが指定されると、文書検索部１４内の記憶部（図示せず）に記憶されたタグ付きテキストの中から、指定されたキーワードを本文に含むタグ付きテキストの集合を作成する。次に、文書検索部１４は、対象文書記憶部１１内のデータをすべていったん空にした上で、作成したタグ付きテキストの集合を、例えば、図４に例示するフォーマットで対象文書記憶部１１に登録する。 When the keyword to be extracted is specified as a query, the document search unit 14 includes the specified keyword in the text from the tagged text stored in the storage unit (not shown) in the document search unit 14. Create a set of tagged text. Next, the document search unit 14 empties all the data in the target document storage unit 11 and then stores the created tagged text set in the target document storage unit 11 in the format illustrated in FIG. 4, for example. sign up.

なお、事例入力プロセスが処理済みである場合（すなわち、抽出規則であるパターンが既に作成されている場合）、パターン評価部１３は、文書検索部１４が対象文書記憶部１１にタグ付きテキストの集合を登録した直後に、評価ステップの処理を開始しても良い。これにより、対象文書記憶部１１内のタグ付きテキスト集合の変化に伴う評価値の変化を即座に反映させることができる。 When the case input process has been processed (that is, when a pattern as an extraction rule has already been created), the pattern evaluation unit 13 causes the document search unit 14 to collect a set of tagged text in the target document storage unit 11. Immediately after registration, the evaluation step process may be started. Thereby, the change of the evaluation value accompanying the change of the tagged text set in the target document storage unit 11 can be reflected immediately.

本発明によれば、パターン合成部１２が、タグ付きテキスト及び抽出位置情報が与えられたときに、抽出位置情報の単語又はタグと、その単語又はタグの前後の単語又はタグとを組み合わせて抽出規則を作成する。そして、パターン評価部１３は、対象文書記憶部１１に記憶されたタグ付きテキストごとに、抽出規則に適合する単語又はタグを含む文（適合文）の文書ＩＤ及び文ＩＤを抽出し、その文書ＩＤ及び文ＩＤをもとに評価値を算出する。このとき、パターン評価部１３は、１つのタグ付きテキスト内に現れる適合文がより少ないほど評価値を高く算出し、より多くのタグ付きテキスト内に適合文が現れるほど評価値を高く算出する。そのため、ユーザが欲する情報を抽出するための規則（抽出規則）を効率よく作成することができる。 According to the present invention, when the tagged text and the extraction position information are given, the pattern synthesis unit 12 extracts a combination of the word or tag of the extraction position information and the word or tag before and after the word or tag. Create a rule. Then, the pattern evaluation unit 13 extracts, for each tagged text stored in the target document storage unit 11, a document ID and a sentence ID of a sentence (matching sentence) including a word or tag that conforms to the extraction rule, and the document An evaluation value is calculated based on the ID and sentence ID. At this time, the pattern evaluation unit 13 calculates a higher evaluation value as there are fewer matching sentences appearing in one tagged text, and calculates a higher evaluation value as more matching sentences appear in more tagged text. Therefore, it is possible to efficiently create a rule (extraction rule) for extracting information desired by the user.

例えば、事例として、図１に例示するタグ付きテキストと、９文字目から１３文字目の「ａｂｃｄ」が指定抽出位置として入力された場合を考える。例えば、ユーザが単に人名を集めたいと考えていると推測した場合には、抽出規則を「＄［人名］＄」にすべきである。また、ユーザが容疑者名を集めたいと考えていると推測した場合には、抽出規則を「＄［人名］＄容疑者」にすべきである。さらに、ユーザが姓「ａｂ」を持つ人名を集めたいと考えていると推測した場合には、抽出規則を「＄ａｂ［名詞］＄」にすべきである。このように、単に事例のみが入力された場合、上記のようにユーザが何を欲しているかによって決定すべき抽出規則は異なる。 For example, as a case, consider a case where the tagged text illustrated in FIG. 1 and “abcd” from the ninth character to the thirteenth character are input as the designated extraction position. For example, if it is assumed that the user simply wants to collect personal names, the extraction rule should be "$ [person name] $". Also, if it is assumed that the user wants to collect suspect names, the extraction rule should be “$ [person name] $ suspects”. Further, if it is assumed that the user wants to collect names with the surname “ab”, the extraction rule should be “$ ab [noun] $”. Thus, when only a case is input, the extraction rule to be determined differs depending on what the user wants as described above.

一般的に、抽出規則作成問題の難しさは、ユーザもしくは外部のプログラムがどのような情報を抽出したいかを入力される事例から推測しなければならない点にあるといえる。しかし、本発明によれば、タグ付きテキストの集合と一つ以上の事例とをもとに、抽出規則を作成し、さらにその抽出規則ごとに評価値を算出する。よって、ユーザの手間を減らしつつユーザの抽出要求に応じた抽出規則を作成することができる。 In general, it can be said that the difficulty of the extraction rule creation problem is that a user or an external program must infer what information the user wants to extract from an input example. However, according to the present invention, an extraction rule is created based on a set of tagged text and one or more cases, and an evaluation value is calculated for each extraction rule. Therefore, the extraction rule according to the user's extraction request can be created while reducing the user's trouble.

また、文書検索部１４が、指定された条件に該当するタグ付きテキストを抽出して対象文書記憶部１１に登録し、パターン評価部１３が、対象文書記憶部１１に登録されたタグ付きテキストごとに適合文の文書ＩＤ及び文ＩＤを抽出してもよい。この場合、パターン評価部１３が抽出するタグ付きテキストを変化させることにより、ユーザが欲するパターンの評価値をカスタマイズできるため、ユーザの欲する情報に合わせた抽出規則を効率よく取り出すことができる。 Further, the document search unit 14 extracts the tagged text corresponding to the designated condition and registers it in the target document storage unit 11, and the pattern evaluation unit 13 performs the tagging text registered in the target document storage unit 11. The document ID and sentence ID of the conforming sentence may be extracted. In this case, since the evaluation value of the pattern desired by the user can be customized by changing the tagged text extracted by the pattern evaluation unit 13, the extraction rule that matches the information desired by the user can be efficiently extracted.

また、パターン合成部１２が、作成した抽出規則のうち、タグを一つ以上含む組合せのパターンを選択してもよい。この場合、無駄なパターンが予め省かれるため、以後の計算量を小さくすることができる。 The pattern composition unit 12 may select a combination pattern including one or more tags from the created extraction rules. In this case, since unnecessary patterns are omitted in advance, the amount of calculation thereafter can be reduced.

また、パターン合成部１２が、抽出位置情報が示す位置に対応する単語又はタグの前後の単語又はタグのうち、予め定められた種類の自立語（名詞など）を組み合わせて抽出規則を作成してもよい。この場合、助詞のように一般的で無意味な語をパターンから排除することができる。 In addition, the pattern synthesis unit 12 creates an extraction rule by combining predetermined types of independent words (such as nouns) among words or tags before and after the word or tag corresponding to the position indicated by the extraction position information. Also good. In this case, common meaningless words such as particles can be excluded from the pattern.

以下、具体的な実施例により本発明を説明するが、本発明の範囲は以下に説明する内容に限定されない。以下の説明では、図１に例示する新聞記事データがタグ付きテキストとして文書検索部１４に記憶されている場合に、ユーザが殺人事件の容疑者名のリストを作成したいと考えている場合を例に挙げて説明する。 Hereinafter, the present invention will be described with reference to specific examples, but the scope of the present invention is not limited to the contents described below. In the following description, when the newspaper article data illustrated in FIG. 1 is stored in the document search unit 14 as tagged text, the user wants to create a list of suspect names of murder cases. Will be described.

検索プロセスにおいて、例えば、ユーザが「殺人容疑者」といったキーワードを指定すると、文書検索部１４は内部に記憶するタグ付きテキストの中から、殺人事件の逮捕情報に関するタグ付きテキストの集合を作成し、対象文書記憶部１１に登録する。これにより、殺人事件に関するタグ付きテキストが対象文書記憶部１１に記憶される。 In the search process, for example, when the user designates a keyword such as “suspected murderer”, the document search unit 14 creates a set of tagged text related to the arrest information of the murder case from among the tagged text stored therein, Register in the target document storage unit 11. Thereby, the tagged text related to the murder case is stored in the target document storage unit 11.

次に、事例入力プロセスにおいて、例えば、ユーザが図１に例示するタグ付きテキストと、９文字目から１３文字目という指定抽出位置をパターン合成部１２に入力すると、パターン合成部１２は、合成ステップを開始する。ここでは、ステップＳ１０，Ｓ２０，Ｓ３０の処理で、図１４に例示するリストＡ、リストＲＰ、リストＬＰ、リストＲＷ、リストＬＷが作成されるものとする。さらに、ステップＳ４０において、パターン合成部１２は、これらのパターンを組み合わせ、図５に例示するパターンを作成し、パターン評価部１３に通知する。 Next, in the case input process, for example, when the user inputs the tagged text illustrated in FIG. 1 and the designated extraction position from the ninth character to the thirteenth character to the pattern composition unit 12, the pattern composition unit 12 performs the composition step. To start. Here, it is assumed that list A, list RP, list LP, list RW, and list LW illustrated in FIG. 14 are created by the processes of steps S10, S20, and S30. Furthermore, in step S <b> 40, the pattern synthesis unit 12 combines these patterns, creates a pattern illustrated in FIG. 5, and notifies the pattern evaluation unit 13 of the pattern.

次に、評価ステップにおいて、パターン評価部１３は、通知を受けた各パターンに対して、検索処理及び評価値算出処理を行い、図５に例示するパターン及び図６に例示する評価値を算出する。本実施例においては、対象文書記憶部１１内に「殺人容疑者」というキーワードを元に作成したタグ付きテキストの集合が記憶されているため、殺人事件の逮捕情報に関する文章が多いと考えられる。そのため、「＄［人名］＄容疑者」や「＄［人名］＄容疑者＊殺人」や「＄［人名］＄容疑者＊殺害」などのパターンにおける評価値が高くなる。 Next, in the evaluation step, the pattern evaluation unit 13 performs a search process and an evaluation value calculation process on each received pattern, and calculates the pattern illustrated in FIG. 5 and the evaluation value illustrated in FIG. . In the present embodiment, since a set of tagged text created based on the keyword “murder suspect” is stored in the target document storage unit 11, it is considered that there are many sentences regarding the arrest information of the murder case. Therefore, the evaluation value in a pattern such as “$ [person name] $ suspect”, “$ [person name] $ suspect * murder”, or “$ [person name] $ suspect * murder” increases.

本発明における評価値は、より多くの文書に多く登場し、より頻度の低いパターンが高くなる値である。このため、対象文書記憶部１１内のすべてのタグ付きテキストに対して各１回登場するようなパターンが高く評価されることを意味する。このため、「奈良」や「ａｂ（ただし、ａｂは姓を表す漢字２文字）」など、特定の事件に特化したキーワードは、より多くの文書に多く登場するとは言えないため、評価値は低くなる。さらに、「＄［人名］＄」のように、任意の人名に適合するパターンは、容疑者名だけでなく被害者名などにまで適合して頻度が高くなるため、評価値は低くなる。「＄［人名］＄容疑者」や「＄［人名］＄容疑者＊殺人」や「＄［人名］＄容疑者＊殺害」などのパターンにおける評価値が高くなるのは、このためである。 The evaluation value according to the present invention is a value that appears more frequently in more documents and that a less frequent pattern becomes higher. For this reason, it means that a pattern that appears once for every tagged text in the target document storage unit 11 is highly evaluated. For this reason, keywords such as “Nara” and “ab (where ab is a two-character kanji for surname)” cannot be said to appear in more documents, so the evaluation value is Lower. Further, since a pattern that matches an arbitrary person name such as “$ [person name] $” matches not only the suspect name but also the victim name, etc., the frequency increases, so the evaluation value decreases. This is why the evaluation values in the patterns such as “$ [person name] $ suspect”, “$ [person name] $ suspect * murder”, and “$ [person name] $ suspect * murder” are high.

他にも、本発明はテキストからユーザの欲しい情報を語句単位でリストアップして出力するリストアップ型の検索を実現する検索システムとして利用できる。また、取り出した値をグラフなどで可視化するテキストマイニングシステムとしても利用できる。 In addition, the present invention can be used as a search system that realizes a list-type search in which information desired by a user is listed from a text and output in units of words. It can also be used as a text mining system that visualizes the extracted values with a graph or the like.

次に、本発明の最小構成を説明する。図１５は、本発明の最小構成を示すブロック図である。本発明による抽出規則作成システムは、文字列中の任意の位置に付加された情報であって、その情報が付加された文字列の位置を示す位置情報（例えば、開始位置及び終了位置）と、その位置に対応する単語の属性を示す属性情報（例えば、名詞、人名など）とを表す情報であるタグの集合とを含む文書であるタグ付きテキストを記憶するタグ付きテキスト記憶手段８１（例えば、対象文書記憶部１１）と、タグ付きテキスト及びそのタグ付きテキスト中の文字列の位置を示す情報である文字列位置情報（例えば、抽出位置情報）が与えられたときに、その文字列位置情報が示す位置に対応する単語又はタグと、その単語又はタグの前後の単語又はタグとを組み合わせて、タグ付きテキストから情報を抽出するための規則である抽出規則（例えば、パターン）を作成する抽出規則作成手段８２（例えば、パターン合成部１２）と、タグ付きテキスト記憶手段８１に記憶されたタグ付きテキストごとに、抽出規則に適合する単語又はタグを含む適合文の位置を示す情報である適合文位置情報（例えば、文書ＩＤ及び文ＩＤ）を抽出する適合文位置情報抽出手段８３（例えば、パターン評価部１３）と、適合文位置情報をもとに、抽出規則を評価した値である評価値を算出する評価値算出手段８４（例えば、パターン評価部１３）とを備えている。 Next, the minimum configuration of the present invention will be described. FIG. 15 is a block diagram showing the minimum configuration of the present invention. The extraction rule creation system according to the present invention is information added to an arbitrary position in a character string, and position information indicating the position of the character string to which the information is added (for example, a start position and an end position), Tagged text storage means 81 for storing tagged text that is a document including attribute information (for example, a noun, a person name, etc.) indicating the attribute of a word corresponding to the position and a set of tags. When the target document storage unit 11) and character string position information (for example, extracted position information) which is information indicating the position of the tagged text and the character string in the tagged text are given, the character string position information An extraction rule (e.g., a rule for extracting information from tagged text by combining a word or tag corresponding to the position indicated by and a word or tag before and after the word or tag) For example, for each tagged text stored in the extraction rule creation means 82 (for example, the pattern synthesis unit 12) and the tagged text storage means 81 for creating a pattern, a matching sentence including a word or tag that matches the extraction rule Based on the matching sentence position information, the matching sentence position information extracting means 83 (for example, the pattern evaluation unit 13) that extracts matching sentence position information (for example, document ID and sentence ID) that is information indicating the position of Evaluation value calculation means 84 (for example, pattern evaluation unit 13) that calculates an evaluation value that is a value obtained by evaluating a rule is provided.

評価値算出手段８４は、１つのタグ付きテキスト内に現れる適合文がより少ないほど評価値を高く算出し、より多くのタグ付きテキスト内に適合文が現れるほど評価値を高く算出する（例えば、式１に基づいて評価値を算出する）。 The evaluation value calculation unit 84 calculates the evaluation value higher as the number of matching sentences appearing in one tagged text is smaller, and calculates the evaluation value as the matching text appears in more tagged texts (for example, (Evaluation value is calculated based on Equation 1).

そのような構成により、ユーザが欲する情報を抽出するための規則を効率よく作成することができる。 With such a configuration, it is possible to efficiently create a rule for extracting information desired by the user.

また、上記の実施形態には、以下に示す構成の抽出規則作成システムが開示されている。 Also, the above embodiment discloses an extraction rule creation system having the following configuration.

（１）文字列中の任意の位置に付加された情報であって、その情報が付加された文字列の位置を示す位置情報（例えば、開始位置及び終了位置）と、その位置に対応する単語の属性を示す属性情報（例えば、名詞、人名など）とを表す情報であるタグの集合とを含む文書であるタグ付きテキストを記憶するタグ付きテキスト記憶手段（例えば、対象文書記憶部１１）と、タグ付きテキスト及びそのタグ付きテキスト中の文字列の位置を示す情報である文字列位置情報（例えば、抽出位置情報）が与えられたときに、その文字列位置情報が示す位置に対応する単語又はタグと、その単語又はタグの前後の単語又はタグとを組み合わせて、タグ付きテキストから情報を抽出するための規則である抽出規則（例えば、パターン）を作成する抽出規則作成手段（例えば、パターン合成部１２）と、タグ付きテキスト記憶手段に記憶されたタグ付きテキストごとに、抽出規則に適合する単語又はタグを含む適合文の位置を示す情報である適合文位置情報（例えば、文書ＩＤ及び文ＩＤ）を抽出する適合文位置情報抽出手段（例えば、パターン評価部１３）と、適合文位置情報をもとに、抽出規則を評価した値である評価値を算出する評価値算出手段（例えば、パターン評価部１３）とを備え、評価値算出手段が、１つのタグ付きテキスト内に現れる適合文がより少ないほど評価値を高く算出し、より多くのタグ付きテキスト内に適合文が現れるほど評価値を高く算出する（例えば、式１に基づいて評価値を算出する）抽出規則作成システム。 (1) Information added to an arbitrary position in a character string, and position information (for example, start position and end position) indicating the position of the character string to which the information is added, and a word corresponding to the position Tagged text storage means (for example, target document storage unit 11) that stores tagged text that is a document including attribute information (for example, nouns, personal names, etc.) indicating a set of tags. When a character string position information (for example, extracted position information) that is information indicating the position of a tagged text and a character string in the tagged text is given, a word corresponding to the position indicated by the character string position information Or, an extraction rule that creates an extraction rule (for example, a pattern) that is a rule for extracting information from a tagged text by combining a tag and the word or a word or tag before or after the tag. Relevant sentence position information which is information indicating the position of a relevant sentence including a word or tag that matches the extraction rule for each tagged text stored in the generating means (for example, the pattern synthesis unit 12) and the tagged text storage means. Based on the matching sentence position information extraction unit (for example, the pattern evaluation unit 13) that extracts (for example, document ID and sentence ID), an evaluation value that is a value obtained by evaluating the extraction rule is calculated based on the matching sentence position information. Evaluation value calculation means (for example, pattern evaluation unit 13), and the evaluation value calculation means calculates the evaluation value higher as the number of matching sentences appearing in one tagged text is smaller, and in more tagged text An extraction rule creation system that calculates a higher evaluation value as a matching sentence appears in (for example, calculates an evaluation value based on Equation 1).

（２）複数のタグ付きテキストの中から、指定された条件に該当するタグ付きテキストを抽出し、そのタグ付きテキスト抽出手段が抽出したタグ付きテキストをタグ付きテキスト記憶手段に登録するタグ付きテキスト登録手段（例えば、文書検索部１４）を備え、適合文位置情報抽出手段が、タグ付きテキスト登録手段が登録したタグ付きテキストごとに適合文位置情報を抽出する抽出規則作成システム。 (2) Tagged text that extracts tagged text corresponding to a specified condition from a plurality of tagged text and registers the tagged text extracted by the tagged text extracting means in the tagged text storage means. An extraction rule creation system comprising registration means (for example, a document search unit 14), wherein the matching sentence position information extraction means extracts matching sentence position information for each tagged text registered by the tagged text registration means.

（３）抽出規則作成手段が、作成した抽出規則のうち、タグを一つ以上含む組合せを選択する抽出規則作成システム。 (3) An extraction rule creation system in which the extraction rule creation means selects a combination including one or more tags from the created extraction rules.

（４）抽出規則作成手段が、文字列位置情報が示す位置に対応する単語又はタグの前後の単語又はタグのうち、予め定められた種類の自立語（例えば、名詞）を組み合わせて抽出規則を作成する抽出規則作成システム。 (4) The extraction rule creating means combines the predetermined types of independent words (for example, nouns) out of the words or tags before and after the word or tag corresponding to the position indicated by the character string position information, and sets the extraction rule. Extraction rule creation system to be created.

（５）抽出規則作成手段が、文字列位置情報が示す位置に対応する単語又はタグと予め定められた種類の自立語との間にワイルドカード（例えば、「＊」）を組み合わせて抽出規則を作成する抽出規則作成システム。 (5) The extraction rule creating means combines the wild card (for example, “*”) between a word or tag corresponding to the position indicated by the character string position information and a predetermined type of self-supporting word, and sets the extraction rule. Extraction rule creation system to be created.

本発明は、文書から情報を抽出するための抽出規則を作成する抽出規則作成システムに好適に適用される。 The present invention is suitably applied to an extraction rule creation system that creates an extraction rule for extracting information from a document.

１１対象文書記憶部
１２パターン合成部
１３パターン評価部
１４文書検索部 11 target document storage unit 12 pattern synthesis unit 13 pattern evaluation unit 14 document search unit

Claims

Information added to an arbitrary position in the character string, which is position information indicating the position of the character string to which the information is added and attribute information indicating the attribute of the word corresponding to the position. Tagged text storage means for storing tagged text that is a document including a set of tags;
When character string position information, which is information indicating the position of the tagged text and the character string in the tagged text, is given, the word or tag corresponding to the position indicated by the character string position information, the word or An extraction rule creating means for creating an extraction rule that is a rule for extracting information from the tagged text by combining words or tags before and after the tag;
For each tagged text stored in the tagged text storage means, matched sentence position information extracting means for extracting matched sentence position information, which is information indicating the position of a matched sentence including a word or tag that matches the extraction rule; ,
Evaluation value calculation means for calculating an evaluation value, which is a value obtained by evaluating the extraction rule, based on the relevant sentence position information;
The evaluation value calculation means calculates the evaluation value higher as there are fewer matching sentences appearing in one tagged text, and calculates the evaluation value as the matching sentences appear in more tagged text. An extraction rule creation system characterized by

Tagged text registration means for extracting the tagged text corresponding to the specified condition from the plurality of tagged text and registering the tagged text in the tagged text storage means,
The extraction rule creation system according to claim 1, wherein the matching sentence position information extraction unit extracts matching sentence position information for each tagged text registered by the tagged text registration unit.

The extraction rule creation system according to claim 1 or 2, wherein the extraction rule creation means selects a combination including one or more tags from the created extraction rules.

The extraction rule creating means creates an extraction rule by combining predetermined types of independent words among words or tags before and after the word or tag corresponding to the position indicated by the character string position information. 4. The extraction rule creation system according to any one of items 3.

The extraction rule creating means according to claim 4, wherein the extraction rule creating means creates an extraction rule by combining a wild card between a word or tag corresponding to the position indicated by the character string position information and a predetermined type of independent word. system.

Information added to an arbitrary position in the character string, which is position information indicating the position of the character string to which the information is added and attribute information indicating the attribute of the word corresponding to the position. When a tagged text that is a document including a set of tags and character string position information that is information indicating a position of a character string in the tagged text are given, a word corresponding to the position indicated by the character string position information Or an extraction rule creating step for creating an extraction rule that is a rule for extracting information from the tagged text by combining the tag and the word or the word or tag before and after the tag;
For each tagged text stored in the tagged text storage means, a matching sentence position information extracting step for extracting matching sentence position information, which is information indicating a position of a matching sentence including a word or tag that matches the extraction rule;
An evaluation value calculating step of calculating an evaluation value that is a value obtained by evaluating the extraction rule based on the relevant sentence position information;
In the evaluation value calculating step, the evaluation value is calculated to be higher as there are fewer matching sentences appearing in one tagged text, and the evaluation value is calculated to be higher as matching sentences appear in more tagged text. An extraction rule creation method characterized by

A tagged text registration step of extracting tagged text corresponding to a specified condition from a plurality of tagged text and registering the tagged text in the tagged text storage means;
The extraction rule creation method according to claim 6, wherein in the matching sentence position information extraction step, matching sentence position information is extracted for each tagged text registered in the text registration step.

Information added to an arbitrary position in the character string, which is position information indicating the position of the character string to which the information is added and attribute information indicating the attribute of the word corresponding to the position. An extraction rule creation program applied to a computer having tagged text storage means for storing tagged text that is a document including a set of tags,
In the computer,
When character string position information, which is information indicating the position of the tagged text and the character string in the tagged text, is given, the word or tag corresponding to the position indicated by the character string position information, the word or An extraction rule creation process for creating an extraction rule that is a rule for extracting information from the tagged text by combining words or tags before and after the tag,
For each tagged text stored in the tagged text storage means, a matching sentence position information extraction process for extracting matching sentence position information, which is information indicating a position of a matching sentence including a word or tag that matches the extraction rule;
Based on the relevant sentence position information, an evaluation value calculation process for calculating an evaluation value that is a value obtained by evaluating the extraction rule is executed,
In the evaluation value calculation process, the evaluation value is calculated to be higher as there are fewer matching sentences appearing in one tagged text, and the evaluation value is calculated to be higher as matching sentences appear in more tagged text. Extraction rule creation program.

On the computer,
Extract the tagged text that meets the specified conditions from the tagged text, and execute the tagged text registration process for registering the tagged text in the tagged text storage means,
The extraction rule creation program according to claim 8, wherein in the matching sentence position information extraction process, the matching sentence position information is extracted for each tagged text registered in the text registration process.