JP4489034B2

JP4489034B2 - Structured document processing apparatus, structured document processing method, and structured document processing program

Info

Publication number: JP4489034B2
Application number: JP2006045808A
Authority: JP
Inventors: 昭子村井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2006-02-22
Filing date: 2006-02-22
Publication date: 2010-06-23
Anticipated expiration: 2026-02-22
Also published as: JP2007226453A

Description

本発明は、構造化された文書を処理する構造化文書処理装置、構造化文書処理方法および構造化文書処理プログラムに関するものである。 The present invention relates to a structured document processing apparatus, a structured document processing method, and a structured document processing program for processing a structured document.

近年、コンピュータでの文書データの管理や処理を容易にするために、地の文書中に文書の構造を示す「タグ」が付加された文書（構造化文書）が広く用いられている。そして、この構造化文書を記述するための規約として、ＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｌｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）やＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）などの構造化文書規約の標準化が進められている。 In recent years, in order to facilitate the management and processing of document data on a computer, a document (structured document) in which a “tag” indicating a document structure is added to a local document has been widely used. Then, standardization of structured document conventions such as SGML (Standard Generalized Markup Language) and XML (extensible Markup Language) is being promoted as a convention for describing this structured document.

このような構造化文書規約に基づいて記述された構造化文書では、タグにより区分されたテキストを検索の単位として、例えば、文書データ中の「見出し」というタグが付加されたテキストに特定のキーワードが含まれている文書データを検出するなど文書データの構造を利用した検索が可能になる（例えば、「特許文献１」参照）。 In a structured document described based on such a structured document convention, for example, a specific keyword is added to text with a tag “headline” in document data, using the text divided by the tag as a search unit. Can be retrieved using the structure of the document data, for example, by detecting document data that includes the text (see, for example, “Patent Document 1”).

このため、このような構造化文書規約を利用すれば、複数の構造化文書を含むデータベースから、文書内の指定した構造にユーザ所望のキーワードを含む文書データを検索することができ、より目的に適切な文書の検索が可能になる。 For this reason, if such structured document rules are used, it is possible to search document data including a keyword desired by a user in a specified structure in a document from a database including a plurality of structured documents. Search for appropriate documents.

また、大量文書データからのキーワードを含む文書データの検索において、転置ファイル方式が利用されている（例えば、「特許文献２」参照）。転置ファイル方式は、検索処理の高速化を図るものである。しかし欠点もあり、検索キーワードによっては全文書内での出現頻度が大量の場合があり、その際の検索処理コストは頻度に応じて高くなる。中でも、索引情報を外部記憶装置から読み出すコストが顕著に高く現れてしまう。 In addition, a transposition file method is used in searching document data including a keyword from a large amount of document data (see, for example, “Patent Document 2”). The transposed file method is intended to speed up search processing. However, there are also disadvantages, and depending on the search keyword, the appearance frequency in all documents may be large, and the search processing cost at that time increases in accordance with the frequency. In particular, the cost of reading the index information from the external storage device appears remarkably high.

特開２０００−２０７４０９号公報JP 2000-207409 A 特開２００１−２６５７７３号公報JP 2001-265773 A

構造を指定した検索処理を行う際に、キーワードが高頻度の場合には、転置ファイル方式を含む従来の方式においては索引情報の読込みのコストが高くなる。このため、実際の該当件数が少なくとも処理コストが高くなるという問題があった。 When performing a search process designating a structure, if keywords are frequently used, the cost of reading index information is high in the conventional method including the inverted file method. For this reason, there has been a problem that the actual number of hits at least increases the processing cost.

索引をさらに構造で細分化して管理することにより、構造を指定した検索処理の処理件数を押さえ処理コスト低減を行うことも考えられる。しかし、構造の複雑さだけ索引の管理コストが高くなること、索引の格納効率も低下すること、効果に対するリソースの消費が著しい事などが予想されるため現実的ではない。また、構造による細分化によって、構造を指定しない場合または複数の構造を指定した場合の性能劣化も予想される。 It is also conceivable to further reduce the processing cost by controlling the index by further subdividing the index into the structure, thereby suppressing the number of search processes that specify the structure. However, it is not realistic because it is expected that the management cost of the index will increase due to the complexity of the structure, the storage efficiency of the index will decrease, and the consumption of resources for the effect will be significant. Further, due to segmentation by structure, performance degradation is expected when the structure is not designated or when a plurality of structures are designated.

本発明は、上記に鑑みてなされたものであって、検索処理コストを抑えつつ、目的に適合した文書を検索することのできる構造化文書処理装置、構造化文書処理方法および構造化文書処理プログラムを提供することを目的とする。 The present invention has been made in view of the above, and provides a structured document processing apparatus, a structured document processing method, and a structured document processing program capable of searching for a document suitable for a purpose while suppressing search processing costs. The purpose is to provide.

本発明の一側面は、構造化文書処理装置に係り、木構造で表現される複数の構造化文書を保持する構造化文書保持手段と、前記構造化文書保持手段が保持する前記構造化文書に含まれる文字列と、当該文字列の前記複数の構造化文書における出現頻度である文字列頻度とを対応付けて保持する文字列頻度保持手段と、前記文字列頻度保持手段に保持されている前記文字列のうち、前記文字列頻度が予め設定された閾値以上の頻出文字列に対し、当該頻出文字列と、前記構造化文書の木構造において当該頻出文字列が現れる位置を示す構造ＩＤとを索引キーとし、当該索引キーに対応付けて、当該索引キーにより特定されるべき文字列を識別するための索引情報を保持する索引情報保持手段と、前記文字列頻度保持手段が保持する前記文字列うち、前記文字列頻度が予め定められた閾値以上である頻出文字列を抽出する頻出文字列抽出手段と、前記頻出文字列抽出手段が抽出した前記頻出文字列の前記構造ＩＤを特定する構造ＩＤ特定手段とを備え、前記構造ＩＤ特定手段は、前記頻出文字列のうち、予め定められた条件に適合する複数の構造ＩＤに対し、当該複数の構造ＩＤを含むグループを識別するグループ識別情報を付与し、前記索引情報保持手段は、前記構造ＩＤ特定手段により特定された前記構造ＩＤを前記索引キーとして保持し、前記グループに含まれる前記頻出文字列については、前記索引キーの前記構造ＩＤとして前記グループ識別情報を保持することを特徴とする。One aspect of the present invention relates to a structured document processing apparatus, comprising: a structured document holding unit that holds a plurality of structured documents expressed in a tree structure; and the structured document held by the structured document holding unit. A character string frequency holding unit that holds a character string included in association with a character string frequency that is an appearance frequency of the character string in the plurality of structured documents, and the character string frequency holding unit holds the character string frequency Among the character strings, for a frequent character string having the character string frequency equal to or higher than a preset threshold, the frequent character string and a structure ID indicating a position where the frequent character string appears in the tree structure of the structured document. An index information holding unit that holds index information for identifying a character string that is to be specified by the index key in association with the index key, and the character string that the character string frequency holding unit holds U , A frequent character string extracting means for extracting a frequent character string having a character string frequency equal to or higher than a predetermined threshold, and a structure ID specifying for identifying the structure ID of the frequent character string extracted by the frequent character string extracting means The structure ID specifying means assigns group identification information for identifying a group including the plurality of structure IDs to a plurality of structure IDs that meet a predetermined condition in the frequent character string. The index information holding means holds the structure ID specified by the structure ID specifying means as the index key, and the frequent character string included in the group is used as the structure ID of the index key. It holds group identification information.

本発明によれば、頻出文字列についても、検索処理コストを抑えつつ、適合した文書を検索することができるという効果を奏する。 According to the present invention , it is possible to search for a suitable document while reducing the search processing cost even with respect to a frequent character string.

以下に、本発明にかかる構造化文書処理装置、構造化文書処理方法および構造化文書処理プログラムの実施の形態を図面に基づいて詳細に説明する。なお、この実施の形態によりこの発明が限定されるものではない。 Embodiments of a structured document processing apparatus, a structured document processing method, and a structured document processing program according to the present invention will be described below in detail with reference to the drawings. Note that the present invention is not limited to the embodiments.

図１は、実施の形態にかかる構造化文書処理装置１０の全体構成を示すブロック図である。構造化文書処理装置１０は、構造化文書取得部１０２と、文字列抽出部１０４と、文字列頻度算出部１０６と、頻出文字列抽出部１０８と、構造ＩＤ特定部１１０と、構造ＩＤ頻度算出部１１２と、索引登録部１１４と、構造化文書データベース（ＤＢ）１２０と、文字列頻度ＤＢ１２２と、構造ＩＤ頻度ＤＢ１２４と、索引ＤＢ１２６と、検索条件取得部１３０と、キーワード文字列抽出部１３２と、検索部１３４と、構造化文書抽出部１３６と、構造化文書出力部１３８とを備えている。 FIG. 1 is a block diagram illustrating an overall configuration of a structured document processing apparatus 10 according to the embodiment. The structured document processing apparatus 10 includes a structured document acquisition unit 102, a character string extraction unit 104, a character string frequency calculation unit 106, a frequent character string extraction unit 108, a structure ID specification unit 110, and a structure ID frequency calculation. Unit 112, index registration unit 114, structured document database (DB) 120, character string frequency DB 122, structure ID frequency DB 124, index DB 126, search condition acquisition unit 130, keyword character string extraction unit 132, , A search unit 134, a structured document extraction unit 136, and a structured document output unit 138.

構造化文書取得部１０２は、外部から構造化文書を取得する。ここで、構造化文書について説明する。図２は、構造化文書を説明するための図である。ここでは、ＸＭＬを構造化文書規約として記述された構造化文書の例を示す。なお、構造化文書は、ＸＭＬを構造化文書規約とするものに限定されるものではなく、例えば、ＳＧＭＬなどの他の構造化文書であってもよい。 The structured document acquisition unit 102 acquires a structured document from the outside. Here, the structured document will be described. FIG. 2 is a diagram for explaining a structured document. Here, an example of a structured document in which XML is described as a structured document rule is shown. Note that the structured document is not limited to XML having a structured document convention, and may be another structured document such as SGML.

図２に示すように、ＸＭＬにおいては、地の文に「タグ」が付加されている。図２に示す例では、「＜ｄｏｃ＞」、「＜／ｄｏｃ＞」、「＜ｔｉｔｌｅ＞」および「＜ｐ＞」などがタグである。この「タグ」により、文書データが構造化されている。ＸＭＬにおいては、タグは、タグの内容を表す名称を「＜」と「＞」で囲むことによって表現される。 As shown in FIG. 2, in XML, a “tag” is added to a sentence in the ground. In the example illustrated in FIG. 2, “<doc>”, “</ doc>”, “<title>”, “<p>”, and the like are tags. The document data is structured by the “tag”. In XML, a tag is expressed by enclosing a name representing the contents of the tag with “<” and “>”.

「＜」と「＞」に挟まれるタグは開始タグ、「＜／」と「＞」の間に挟まれるタグは終了タグである。テキストを開始タグと終了タグで挟むことにより、そのテキストが構造化される。開始タグと終了タグにより挟まれたテキストをテキスト要素と称する。 A tag sandwiched between “<” and “>” is a start tag, and a tag sandwiched between “</” and “>” is an end tag. By sandwiching the text between a start tag and an end tag, the text is structured. Text between the start tag and the end tag is referred to as a text element.

図２に示す例では、「＜ｔｉｔｌｅ＞」と「＜／ｔｉｔｌｅ＞」とで、「ＸＭＬデータベース」というテキストが囲まれている。これにより、このテキストに対して「タイトル」という構造が付与されている。
In the example shown in FIG. 2, the text “XML database” is surrounded by “<title>” and “</ title>”. Thereby, a structure of “title” is given to this text.

図３は、図２に示す構造化文書の構造を概念的に示す図である。このように、図２に示す構造化文書は木構造により表現される。タグによって区分された各テキスト要素が木構造のノードとして表されている。さらに、各タグもノードとして表されている。そして、各ノードには、各ノードを識別する要素ＩＤが付与されている。 FIG. 3 is a diagram conceptually showing the structure of the structured document shown in FIG. In this way, the structured document shown in FIG. 2 is represented by a tree structure. Each text element divided by the tag is represented as a tree-structured node. Furthermore, each tag is also represented as a node. Each node is given an element ID for identifying each node.

例えば、タグ「ｄｏｃ」には、要素ＩＤ「１」が付与され、「ｔｉｔｌｅ」には、要素ＩＤ「２」が付与されている。また、テキスト要素「ＸＭＬデータベース」には、要素ＩＤ「３」が付与されている。このように、要素ＩＤは、構造化文書中の各要素を識別するためのＩＤである。 For example, an element ID “1” is assigned to the tag “doc”, and an element ID “2” is assigned to “title”. The element ID “3” is assigned to the text element “XML database”. Thus, the element ID is an ID for identifying each element in the structured document.

図４は、構造化文書における構造ＩＤを説明するための図である。各タグ要素および各テキスト要素は、木構造のいずれかの位置に配置されている。この配置位置を識別するのが構造ＩＤである。すなわち、構造ＩＤは「ｒｏｏｔ」からの経路を識別する情報である。 FIG. 4 is a diagram for explaining the structure ID in the structured document. Each tag element and each text element are arranged at any position in the tree structure. The structure ID identifies this arrangement position. That is, the structure ID is information for identifying a route from “root”.

例えば、タグ「ｄｏｃ」には、構造ＩＤ「１」が付与され、「ｔｉｔｌｅ」には、構造ＩＤ「２」が付与される。また、「ｃｈａｐｔｅｒ」に含まれる３つの「ｐ（パラグラフ）」はいずれも、「ｒｏｏｔ」からの経路が等しい。そこで、これら３つの「ｐ」には同一の構造ＩＤ「５」が付与される。 For example, the structure ID “1” is assigned to the tag “doc”, and the structure ID “2” is assigned to “title”. In addition, the three “p (paragraph)” included in “chapter” have the same route from “root”. Therefore, the same structure ID “5” is assigned to these three “p”.

また、図４に示す構造文書以外の構造文書においても図４に示す構造文書と同様に「ｒｏｏｔ」の枝として「ｄｏｃ」が存在し、さらに「ｄｏｃ」の枝として「ｔｉｔｌｅ」が存在するとする。この場合には、この「ｔｉｔｌｅ」に対しても、図４に示す構造文書内の「ｔｉｔｌｅ」と同一の構造ＩＤ、即ち構造ＩＤ「２」が付与される。このように、構造ＩＤは、構造化文書によらず、構造化文書内における構造的な位置が同一であれば同一の値となる。 Also, in the structure document other than the structure document shown in FIG. 4, it is assumed that “doc” exists as a branch of “root” and “title” exists as a branch of “doc” similarly to the structure document shown in FIG. . In this case, the same structure ID as the “title” in the structure document shown in FIG. 4, that is, the structure ID “2” is given to this “title”. As described above, the structure ID is the same value if the structural position in the structured document is the same regardless of the structured document.

図５は、構造化文書ＤＢ１２０のデータ構成を模式的に示す図である。このように、構造化文書ＤＢ１２０は、構造化文書と、構造化文書を識別する文書ＩＤとを対応付けて保持している。 FIG. 5 is a diagram schematically showing the data structure of the structured document DB 120. As described above, the structured document DB 120 holds the structured document in association with the document ID for identifying the structured document.

図１の文字列抽出部１０４は、構造化文書ＤＢ１２０に登録されている構造化文書から予め設定されている条件にしたがい文字列を抽出する。そして、抽出した文字列の数を文字列頻度ＤＢ１２２に登録する。 The character string extraction unit 104 in FIG. 1 extracts a character string from a structured document registered in the structured document DB 120 according to a preset condition. Then, the number of extracted character strings is registered in the character string frequency DB 122.

図６は、文字列抽出部１０４が抽出する文字列を説明するための図である。図６は、１つのテキスト要素を示している。文字列は、このテキスト要素内に含まれる一部の文字列である。例えば、「データ」、「ベース」などが文字列として抽出される。予め設定されている条件とは、例えば、英数文字およびカタカナは３文字単位で抽出し、ひらがなは２文字単位で抽出するなどの条件である。 FIG. 6 is a diagram for explaining a character string extracted by the character string extraction unit 104. FIG. 6 shows one text element. The character string is a partial character string included in the text element. For example, “data”, “base”, and the like are extracted as character strings. The preset condition is, for example, a condition that alphanumeric characters and katakana are extracted in units of three characters and hiragana is extracted in units of two characters.

図７は、文字列抽出の処理を説明するための図である。このように、条件に従い所定の文字単位で文字列を抽出する際に、１文字ずつずらして抽出してもよい。これにより文字列の抽出漏れを防止することができる。なお、文字列抽出の方法は、本実施の形態に限定されるものではなく、例えば日本語形態素解析を利用した単語単位で文字列を抽出してもよい。 FIG. 7 is a diagram for explaining the character string extraction process. As described above, when extracting a character string in a predetermined character unit according to the condition, it may be extracted by shifting one character at a time. Thereby, omission of extraction of a character string can be prevented. Note that the method of extracting a character string is not limited to the present embodiment, and for example, a character string may be extracted in units of words using Japanese morphological analysis.

図１に示す文字列頻度算出部１０６は、文字列抽出部１０４が特定した各文字列の出現頻度、すなわち文字列頻度を算出する。ここで、文字列頻度とは、構造化文書ＤＢ１２０に登録されている構造化文書において、対象とする文字列が出現する回数である。例えば、構造化文書ＤＢ１２０に複数の構造化文書が登録されている場合には、複数の構造化文書すべてにおいて、対象とする文字列が出現する回数である。 The character string frequency calculation unit 106 illustrated in FIG. 1 calculates the appearance frequency of each character string specified by the character string extraction unit 104, that is, the character string frequency. Here, the character string frequency is the number of times a target character string appears in a structured document registered in the structured document DB 120. For example, when a plurality of structured documents are registered in the structured document DB 120, the number of times a target character string appears in all the plurality of structured documents.

例えば、図２に示す構造化文書には、「データ」という文字列が４個含まれている。この場合には、文字列頻度は４となる。文字列頻度算出部１０６はさらに、他の構造化文書に含まれる「データ」という文字列の文字列頻度を加算して、構造化文書ＤＢ１２０に登録されている構造化文書における「データ」という文字列の文字列頻度を算出する。 For example, the structured document shown in FIG. 2 includes four character strings “data”. In this case, the character string frequency is 4. The character string frequency calculation unit 106 further adds the character string frequencies of the character string “data” included in the other structured documents, and adds the character “data” in the structured document registered in the structured document DB 120. Calculate the string frequency of the column.

文字列頻度算出部１０６は、各文字列の文字列頻度を、文字列に対応付けて文字列頻度ＤＢ１２２に登録する。 The character string frequency calculation unit 106 registers the character string frequency of each character string in the character string frequency DB 122 in association with the character string.

図８は、文字列頻度ＤＢ１２２のデータ構成を模式的に示す図である。文字列抽出部１０４により抽出された文字列と、各文字列に対し文字列抽出部１０４が算出した文字列頻度とが対応付けて保持されている。 FIG. 8 is a diagram schematically showing the data configuration of the character string frequency DB 122. The character string extracted by the character string extraction unit 104 and the character string frequency calculated by the character string extraction unit 104 are stored in association with each character string.

図１に示す頻出文字列抽出部１０８は、文字列頻度ＤＢ１２２に登録されている文字列頻度に基づいて、頻出文字列を抽出する。ここで、頻出文字列とは、文字列頻度が予め設定された閾値よりも大きい値となる文字列のことである。閾値は例えば、「１００００」と設定する。このように閾値は絶対値で設定する。 The frequent character string extraction unit 108 illustrated in FIG. 1 extracts a frequent character string based on the character string frequency registered in the character string frequency DB 122. Here, the frequent character string is a character string whose character string frequency is larger than a preset threshold value. For example, the threshold is set to “10000”. Thus, the threshold value is set as an absolute value.

また、他の例としては、構造化文書ＤＢ１２０に登録されているすべての構造化文書に含まれる文字列の数を基準とした相対的な値を閾値として設定してもよい。例えば、すべての構造化文書に含まれる文字列の数を１とした場合の所定の文字列の数の割合「０．８」を相対的な閾値として設定してもよい。 As another example, a relative value based on the number of character strings included in all structured documents registered in the structured document DB 120 may be set as a threshold value. For example, a ratio “0.8” of a predetermined number of character strings when the number of character strings included in all structured documents is 1 may be set as a relative threshold value.

構造ＩＤ特定部１１０は、構造化文書ＤＢ１２０に保持されている構造化文書に含まれる各頻出文字列の、構造化文書における配置位置、すなわち構造ＩＤを特定する。そして、構造ＩＤを頻出文字列に対応付けて構造ＩＤ頻度ＤＢ１２４に登録する。 The structure ID specifying unit 110 specifies the arrangement position in the structured document, that is, the structure ID, of each frequent character string included in the structured document held in the structured document DB 120. Then, the structure ID is registered in the structure ID frequency DB 124 in association with the frequent character string.

構造ＩＤ頻度算出部１１２は、構造ＩＤ特定部１１０が特定した構造ＩＤの出現頻度、すなわち構造ＩＤ頻度を算出する。ここで、構造ＩＤ頻度とは、構造化文書ＤＢ１２０に登録されている構造文書において、所定の構造ＩＤに配置された所定の頻出文字列の数である。 The structure ID frequency calculating unit 112 calculates the appearance frequency of the structure ID specified by the structure ID specifying unit 110, that is, the structure ID frequency. Here, the structure ID frequency is the number of predetermined frequent character strings arranged in a predetermined structure ID in the structure document registered in the structured document DB 120.

例えば、図４に示す木構造において「データ」が頻出文字であるとすると、構造ＩＤ「５」における「データ」という頻出文字列の構造ＩＤ頻度は「３」である。また、構造ＩＤ「３」における構造ＩＤ頻度は「１」である。さらに、他の構造化文書における同一の構造ＩＤに対する構造ＩＤ頻度を加算し、構造ＩＤ頻度が算出される。 For example, if “data” is a frequent character in the tree structure shown in FIG. 4, the structure ID frequency of the frequent character string “data” in the structure ID “5” is “3”. The structure ID frequency in the structure ID “3” is “1”. Furthermore, the structure ID frequency is calculated by adding the structure ID frequencies for the same structure ID in other structured documents.

図９は、構造ＩＤ頻度ＤＢ１２４のデータ構成を模式的に示す図である。頻出文字列抽出部１０８により抽出された頻出文字列と、当該頻出文字列が出現する構造ＩＤとが対応付けられている。さらに構造ＩＤには、構造ＩＤ頻度が対応付けられている。 FIG. 9 is a diagram schematically illustrating the data configuration of the structure ID frequency DB 124. The frequent character string extracted by the frequent character string extraction unit 108 is associated with the structure ID in which the frequent character string appears. Further, the structure ID frequency is associated with the structure ID.

図１に示す索引登録部１１４は、文字列頻度ＤＢ１２２および構造ＩＤ頻度ＤＢ１２４に登録されている情報に基づいて、索引キーと索引情報を索引ＤＢ１２６に登録する。具体的には、文字列抽出部１０４が抽出した文字列を索引キーとして登録する。索引情報には、索引キーである文字列を含む構造化文書の文書ＩＤ、文字列を含むテキスト要素の要素ＩＤおよびオフセットが含まれる。ここで、オフセットとは、テキスト要素内における文字列の位置を示す情報である。オフセットは、要素内位置情報に相当する。なお、頻出文字列については、文字列と構造ＩＤを索引キーとして登録し、各索引キーに索引情報を対応付けて登録する。 The index registration unit 114 illustrated in FIG. 1 registers an index key and index information in the index DB 126 based on information registered in the character string frequency DB 122 and the structure ID frequency DB 124. Specifically, the character string extracted by the character string extraction unit 104 is registered as an index key. The index information includes the document ID of the structured document including the character string that is the index key, the element ID of the text element including the character string, and the offset. Here, the offset is information indicating the position of the character string in the text element. The offset corresponds to the in-element position information. For frequent character strings, a character string and a structure ID are registered as index keys, and index information is registered in association with each index key.

このように、索引情報は文字列を特定するための情報なので、索引情報を利用することにより、各文字列を含むテキスト要素、テキスト要素における文字列の位置、文字列を含む構造化文書を特定することができる。さらに、頻出文字列については、文字列と構造ＩＤを検索キーとすることにより、頻出文字列における検索処理コストを抑えることができる。 In this way, the index information is information for specifying a character string. By using the index information, the text element including each character string, the position of the character string in the text element, and the structured document including the character string are specified. can do. Furthermore, for a frequent character string, the search processing cost in the frequent character string can be suppressed by using the character string and the structure ID as a search key.

図１０は、オフセットを説明するための図である。図１０に示すようにテキスト要素「データベース・・・」においては、先頭文字「デ」をオフセット位置「１」として、順にオフセット位置が定められている。したがって、例えば、文字列「データ」は、テキスト要素中先頭文字から始まる３文字なので、オフセット「１」となる。文字列「ータベ」の先頭文字「ー」は、先頭から２番目の文字なので、文字列「ータベ」のオフセットは「２」となる。 FIG. 10 is a diagram for explaining the offset. As shown in FIG. 10, in the text element “database...”, The offset position is set in order with the first character “de” as the offset position “1”. Therefore, for example, since the character string “data” is three characters starting from the first character in the text element, the offset is “1”. Since the first character “-” of the character string “Tabe” is the second character from the beginning, the offset of the character string “Tabe” is “2”.

索引ＤＢ１２６は、索引テーブルを保持している。より具体的には、文書ＩＤごとのテーブルからなる文書単位索引テーブルと、構造化文書によらずすべての文字列を含む文字列単位索引テーブルとを保持している。 The index DB 126 holds an index table. More specifically, a document unit index table including a table for each document ID and a character string unit index table including all character strings regardless of the structured document are held.

図１１は、索引ＤＢ１２６が保持する文書ＩＤごとの文書単位索引テーブル１２７のデータ構成を模式的に示す図である。文書単位索引テーブルは、索引キーとしての文字列と索引情報とを対応付けて保持している。さらに、索引情報には、文書ＩＤ、要素ＩＤおよびオフセットが含まれている。なお、文書単位索引テーブルにおいては、同一の索引キーのとしての文字列であっても、文書ＩＤが異なる場合には、それぞれの文字列が索引キーとして登録される。 FIG. 11 is a diagram schematically showing the data configuration of the document unit index table 127 for each document ID held in the index DB 126. The document unit index table holds a character string as an index key and index information in association with each other. Further, the index information includes a document ID, an element ID, and an offset. In the document unit index table, even if the character strings are the same index key, if the document IDs are different, each character string is registered as an index key.

図１２は、索引ＤＢ１２６が保持する文字列ごとの文字列単位索引テーブル１２８のデータ構成を模式的に示す図である。文字列単位索引テーブル１２８は、図１１を参照しつつ説明した文書単位索引テーブル１２７と同様に、索引キーとしての文字列と、索引情報とを対応付けて保持している。文書単位索引テーブル１２７は、文書単位のテーブルであるが、文字列単位索引テーブル１２８においては、同一の文字列が複数の構造化文書に存在する場合には、これらいずれの文字列の検索条件も、索引キーとしての所定の１つの文字列に対応付けられている。したがって、複数の構造化文書を検索対象とすることができ、特定の文字列を含む構造化文書やテキスト要素を容易に検出することができる。 FIG. 12 is a diagram schematically showing the data configuration of the character string unit index table 128 for each character string held in the index DB 126. Similarly to the document unit index table 127 described with reference to FIG. 11, the character string unit index table 128 holds character strings as index keys and index information in association with each other. The document unit index table 127 is a table in units of documents. In the character string unit index table 128, when the same character string exists in a plurality of structured documents, the search condition for any of these character strings is also set. Are associated with a predetermined character string as an index key. Therefore, a plurality of structured documents can be searched, and structured documents and text elements including specific character strings can be easily detected.

文字列単位索引テーブル１２８においては、頻出文字列は、構造ＩＤ単位でさらに分割されており、文字列に加えて構造ＩＤが索引キーとして登録されている。そして、構造ＩＤに対応付けて各索引情報が登録されている。 In the character string unit index table 128, the frequent character strings are further divided in units of structure IDs, and the structure IDs are registered as index keys in addition to the character strings. Each index information is registered in association with the structure ID.

さらに、索引登録部１１４は、頻出文字列の構造ＩＤのうち構造ＩＤ頻度ＤＢ１２４において構造ＩＤ頻度が予め設定された閾値よりも小さいものについては、複数の構造ＩＤを１つの構造ＩＤグループとする。そして、この構造ＩＤグループを識別するグループＩＤを構造ＩＤとして文字列単位索引テーブル１２８に登録する。これにより、構造ＩＤの種類が多い場合であっても、細分化による管理コストの増加を抑えることができる。 Further, the index registering unit 114 sets a plurality of structure IDs as one structure ID group for the structure IDs of frequent character strings whose structure ID frequency is lower than a threshold value set in advance in the structure ID frequency DB 124. Then, the group ID for identifying the structure ID group is registered in the character string unit index table 128 as the structure ID. Thereby, even if there are many types of structure IDs, an increase in management cost due to subdivision can be suppressed.

図１に示す検索条件取得部１３０は、入出力装置などを介して検索条件を取得する。ここで、検索条件には、検索キーワードと検索範囲が含まれている。 The search condition acquisition unit 130 illustrated in FIG. 1 acquires search conditions via an input / output device or the like. Here, the search condition includes a search keyword and a search range.

キーワード文字列抽出部１３２は、検索条件取得部１３０が取得した検索条件からキーワード文字列を抽出する。ここで、キーワード文字列とは、キーワードに含まれる文字列である。文字列を抽出する処理は、文字列抽出部１０４がテキスト要素から文字列を抽出する処理と同様である。 The keyword character string extraction unit 132 extracts a keyword character string from the search condition acquired by the search condition acquisition unit 130. Here, the keyword character string is a character string included in the keyword. The process of extracting the character string is the same as the process of extracting the character string from the text element by the character string extraction unit 104.

検索部１３４は、索引ＤＢ１２６において、キーワード文字列抽出部１３２により抽出されたキーワード文字列に一致する索引キーを検索する。構造化文書抽出部１３６は、索引ＤＢ１２６において検索部１３４が特定した索引キーに対応付けられている索引情報により特定される構造化文書を構造化文書ＤＢ１２０から抽出する。構造化文書出力部１３８は、構造化文書抽出部１３６により抽出された構造化文書を検索結果として出力する。 The search unit 134 searches the index DB 126 for an index key that matches the keyword character string extracted by the keyword character string extraction unit 132. The structured document extraction unit 136 extracts a structured document specified by the index information associated with the index key specified by the search unit 134 in the index DB 126 from the structured document DB 120. The structured document output unit 138 outputs the structured document extracted by the structured document extraction unit 136 as a search result.

図１３は、構造化文書処理装置１０における索引登録処理を示すフローチャートである。まずユーザがマウスやキーボード、ディスプレイモニタなどの入出力装置を利用して構造化文書を入力する。構造化文書処理装置１０の構造化文書取得部１０２は、こうしてユーザから入力された構造化文書を取得する（ステップＳ１００）。 FIG. 13 is a flowchart showing index registration processing in the structured document processing apparatus 10. First, a user inputs a structured document using an input / output device such as a mouse, a keyboard, or a display monitor. The structured document acquisition unit 102 of the structured document processing apparatus 10 acquires the structured document input from the user in this way (step S100).

次に、構造化文書取得部１０２が取得した構造化文書は、構造化文書ＩＤと対応付けて構造化文書ＤＢ１２０に登録される（ステップＳ１０２）。次に、文字列抽出部１０４は、構造化文書ＤＢ１２０に登録されている各構造化文書から文字列を抽出する（ステップＳ１０４）。抽出された文字列は、索引キーとして、索引情報とともに文書単位索引テーブル１２７に登録される。次に、文字列頻度算出部１０６は、文字列抽出部１０４により抽出された各文字列の文字列頻度を算出する（ステップＳ１０６）。各文字列と文字列頻度とが対応付けられて文字列頻度ＤＢ１２２に登録される（ステップＳ１０８）。 Next, the structured document acquired by the structured document acquisition unit 102 is registered in the structured document DB 120 in association with the structured document ID (step S102). Next, the character string extraction unit 104 extracts a character string from each structured document registered in the structured document DB 120 (step S104). The extracted character string is registered in the document unit index table 127 together with index information as an index key. Next, the character string frequency calculation unit 106 calculates the character string frequency of each character string extracted by the character string extraction unit 104 (step S106). Each character string is associated with the character string frequency and registered in the character string frequency DB 122 (step S108).

次に、頻出文字列抽出部１０８は、文字列頻度ＤＢ１２２に登録されている文字列の文字列頻度に基づいて、頻出文字列を抽出する（ステップＳ１１０）。次に、構造ＩＤ特定部１１０は、頻出文字列抽出部１０８により抽出された頻出文字列の構造ＩＤを特定する（ステップＳ１１２）。次に、構造ＩＤ頻度算出部１１２は、構造ＩＤ特定部１１０により特定された構造ＩＤごとの頻出文字列の出現頻度、すなわち構造ＩＤ頻度を算出する（ステップＳ１１４）。各頻出文字列が属する構造の構造ＩＤと、構造ＩＤとが対応付けられて構造ＩＤ頻度ＤＢ１２４に登録される（ステップＳ１１６）。次に、索引登録部１１４は、文字列頻度ＤＢ１２２および構造ＩＤ頻度ＤＢ１２４に登録されている情報に基づいて、索引キーと索引情報とを対応付けて索引ＤＢ１２６に登録する（ステップＳ１１８）。具体的には、頻出文字列および構造ＩＤを索引キーとして、索引情報とともに文字列単位索引テーブル１２８に登録する。以上で、索引登録処理が完了する。 Next, the frequent character string extraction unit 108 extracts a frequent character string based on the character string frequency of the character string registered in the character string frequency DB 122 (step S110). Next, the structure ID specifying unit 110 specifies the structure ID of the frequent character string extracted by the frequent character string extracting unit 108 (step S112). Next, the structure ID frequency calculating unit 112 calculates the appearance frequency of the frequent character string for each structure ID specified by the structure ID specifying unit 110, that is, the structure ID frequency (step S114). The structure ID of the structure to which each frequent character string belongs and the structure ID are associated with each other and registered in the structure ID frequency DB 124 (step S116). Next, the index registration unit 114 registers the index key and the index information in the index DB 126 in association with each other based on the information registered in the character string frequency DB 122 and the structure ID frequency DB 124 (step S118). Specifically, the frequent character string and the structure ID are registered in the character string unit index table 128 together with the index information as an index key. This completes the index registration process.

ここで、索引登録処理についてより具体的に説明する。例えば、図８に示すような文字列と文字列頻度とが登録されたとする。そして、頻出文字列判定の際に利用する閾値が「１００００」と設定されているとする。この場合には、頻出文字列として、「データ」が抽出される。さらに、構造化文書ＤＢ１２０に登録されているすべての構造化文書に含まれる頻出文字列「データ」の構造ＩＤごとの構造ＩＤ頻度が構造ＩＤ頻度ＤＢ１２４に登録される。 Here, the index registration process will be described more specifically. For example, assume that a character string and a character string frequency as shown in FIG. 8 are registered. Then, it is assumed that the threshold used for frequent character string determination is set to “10000”. In this case, “data” is extracted as a frequent character string. Further, the structure ID frequency for each structure ID of the frequent character string “data” included in all structured documents registered in the structured document DB 120 is registered in the structure ID frequency DB 124.

文字列単位索引テーブル１２８に索引キーとして構造ＩＤを含めるべき構造ＩＤ頻度の閾値が「５００」と設定されているとする。この場合には、図９に示す構造ＩＤ「６」および構造ＩＤ「１９」に対する構造ＩＤを同一の構造ＩＤグループとして統合し、構造ＩＤグループを識別するグループＩＤ「Ｇ１」を付与する。そして、図１２に示すように文字列単位索引テーブル１２８には、構造ＩＤ「６」および構造ＩＤ「１９」に対応付けられている頻出文字列「データ」に対する索引情報が、構造ＩＤグループを識別する「Ｇ１」と文字列「データ」とを含む索引キーに対応付けて登録される。 Assume that the threshold value of the structure ID frequency that should include the structure ID as an index key is set to “500” in the character string unit index table 128. In this case, the structure IDs corresponding to the structure ID “6” and the structure ID “19” shown in FIG. 9 are integrated as the same structure ID group, and a group ID “G1” for identifying the structure ID group is given. As shown in FIG. 12, in the character string unit index table 128, the index information for the frequent character string “data” associated with the structure ID “6” and the structure ID “19” identifies the structure ID group. To be registered in association with an index key including “G1” and the character string “data”.

頻出文字列については、構造ＩＤごとに索引キーを割り当てることにより頻出文字列を分散させることができる。これにより、構造を指定した検索処理コストを軽減することができる。 For frequent character strings, frequent character strings can be distributed by assigning an index key to each structure ID. As a result, the search processing cost for designating the structure can be reduced.

さらに、同一の文字列（頻出文字列）を構造ＩＤごとに分割した場合には、索引キー数が著しく増加するなど、データ格納効率劣化やシステムリソース圧迫の原因になり得る問題が起こる可能性がある。そこで、上述のように、対応する索引情報が少ないものについては１つのグループにまとめることにより、上記のよう頻出文字列については検索処理コストを軽減しつつ索引キー数の増加を抑制することができる。 Furthermore, when the same character string (frequent character string) is divided for each structure ID, there is a possibility that problems such as a significant increase in the number of index keys may cause data storage efficiency degradation and system resource pressure. is there. Therefore, as described above, those with a small number of corresponding index information are collected into one group, so that it is possible to suppress an increase in the number of index keys while reducing the search processing cost for the frequent character strings as described above. .

なお、他の例としては、構造ＩＤ頻度が閾値以下のものをグループ化するのにかえて、構造化文書の木構造におけるパスが類似するものをグループ化することとしてもよい。また、他の例としては、「ｒｏｏｔ」からの経路にかかわらず、同じタグ名を有する構造ＩＤを統合し、統合した構造ＩＤそれぞれの出現頻度の合計を構造ＩＤ頻度として記憶してもよい。このように、グループ化するための条件は実施の形態に限定されるものではない。 As another example, instead of grouping those whose structure ID frequency is less than or equal to the threshold value, those having similar paths in the tree structure of the structured document may be grouped. As another example, structure IDs having the same tag name may be integrated regardless of the route from “root”, and the total appearance frequency of each integrated structure ID may be stored as the structure ID frequency. Thus, the conditions for grouping are not limited to embodiment.

図１４は、構造化文書処理装置１０における検索処理を示すフローチャートである。まず、ユーザが入出力装置を利用して検索条件を入力する。構造化文書処理装置１０の検索条件取得部１３０は、こうしてユーザから入力された検索条件を取得する（ステップＳ２００）。次に、キーワード文字列抽出部１３２は、検索条件取得部１３０が取得した検索条件に含まれるキーワードからキーワード文字列を抽出する（ステップＳ２０２）。 FIG. 14 is a flowchart showing search processing in the structured document processing apparatus 10. First, a user inputs search conditions using an input / output device. The search condition acquisition unit 130 of the structured document processing apparatus 10 acquires the search condition input by the user in this way (step S200). Next, the keyword character string extraction unit 132 extracts a keyword character string from keywords included in the search condition acquired by the search condition acquisition unit 130 (step S202).

次に、検索部１３４は、キーワード文字列抽出部１３２により抽出されたキーワード文字列と一致する文字列を索引ＤＢ１２６において検索する（ステップＳ２０４）。このとき、検索条件取得部１３０が取得した検索条件において検索範囲が指定されている場合には、指定された検索範囲内において検索する。また、検索条件に構造ＩＤの指定が含まれている場合には、この構造ＩＤとキーワード文字列とに一致する索引キーを検索する。 Next, the search unit 134 searches the index DB 126 for a character string that matches the keyword character string extracted by the keyword character string extraction unit 132 (step S204). At this time, if a search range is specified in the search condition acquired by the search condition acquisition unit 130, the search is performed within the specified search range. If the search condition includes the designation of the structure ID, an index key matching the structure ID and the keyword character string is searched.

構造化文書抽出部１３６は、索引ＤＢ１２６において検索部１３４により検出された索引キーに対応付けられている索引情報に基づいて、構造化文書ＤＢ１２０から対応する構造化文書を抽出する（ステップＳ２０６）。なお、他の例としては、構造化文書のうち対応する部分のみを抽出してもよい。次に、構造化文書出力部１３８は、構造化文書抽出部１３６により抽出された構造化文書を検索結果として出力する（ステップＳ２０８）。以上で、検索処理が完了する。 The structured document extraction unit 136 extracts a corresponding structured document from the structured document DB 120 based on the index information associated with the index key detected by the search unit 134 in the index DB 126 (step S206). As another example, only the corresponding part of the structured document may be extracted. Next, the structured document output unit 138 outputs the structured document extracted by the structured document extraction unit 136 as a search result (step S208). This completes the search process.

図１５は、実施の形態にかかる構造化文書処理装置１０のハードウェア構成を示す図である。構造化文書処理装置１０は、ハードウェア構成として、構造化文書処理装置１０における索引登録処理および検索処理を実行する構造化文書処理プログラムなどが格納されているＲＯＭ５２と、ＲＯＭ５２内のプログラムに従って構造化文書処理装置１０の各部を制御するＣＰＵ５１と、構造化文書処理装置１０の制御に必要な種々のデータを記憶するＲＡＭ５３と、ネットワークに接続して通信を行う通信Ｉ／Ｆ５７と、各部を接続するバス６２とを備えている。 FIG. 15 is a diagram illustrating a hardware configuration of the structured document processing apparatus 10 according to the embodiment. The structured document processing apparatus 10 is structured in accordance with a ROM 52 storing a structured document processing program for executing an index registration process and a search process in the structured document processing apparatus 10 and a program in the ROM 52 as a hardware configuration. A CPU 51 that controls each part of the document processing apparatus 10, a RAM 53 that stores various data necessary for controlling the structured document processing apparatus 10, a communication I / F 57 that communicates by connecting to a network, and each part are connected. And a bus 62.

先に述べた構造化文書処理装置１０における構造化文書処理プログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フロッピー（Ｒ）ディスク（ＦＤ）、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録されて提供されてもよい。 The structured document processing program in the structured document processing apparatus 10 described above is a file in an installable or executable format and can be read by a computer such as a CD-ROM, floppy (R) disk (FD), or DVD. The program may be recorded on a recording medium.

この場合には、構造化文書処理プログラムは、構造化文書処理装置１０において上記記録媒体から読み出して実行することにより主記憶装置上にロードされ、上記ソフトウェア構成で説明した各部が主記憶装置上に生成されるようになっている。 In this case, the structured document processing program is loaded on the main storage device by being read from the recording medium and executed by the structured document processing device 10, and each unit described in the software configuration is loaded on the main storage device. It is to be generated.

また、本実施の形態の構造化文書処理プログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。 Further, the structured document processing program of the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network.

以上、本発明を実施の形態を用いて説明したが、上記実施の形態に多様な変更または改良を加えることができる。 As described above, the present invention has been described using the embodiment, but various changes or improvements can be added to the above embodiment.

そうした第１の変更例としては、複数の構造ＩＤを１つのグループに統合する処理を行わなくてもよい。この場合には、構造化文書処理装置１０は、構造ＩＤ頻度算出部１１２は備えなくともよい。そして、構造ＩＤ頻度を算出し、構造ＩＤ頻度に基づいてグループに統合する処理を省略する。 As such a first modification, it is not necessary to perform processing for integrating a plurality of structure IDs into one group. In this case, the structured document processing apparatus 10 does not have to include the structure ID frequency calculation unit 112. And the process which calculates structure ID frequency and integrates into a group based on structure ID frequency is abbreviate | omitted.

また、第２の変更例としては、構造化文書処理装置１０は、図１１に示す文書単位索引テーブル１２７は保持しなくともよい。 As a second modification, the structured document processing apparatus 10 may not hold the document unit index table 127 shown in FIG.

実施の形態にかかる構造化文書処理装置１０の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the structured document processing apparatus 10 concerning embodiment. 構造化文書を説明するための図である。It is a figure for demonstrating a structured document. 図２に示す構造化文書の構造を概念的に示す図である。It is a figure which shows notionally the structure of the structured document shown in FIG. 構造化文書における構造ＩＤを説明するための図である。It is a figure for demonstrating structure ID in a structured document. 構造化文書ＤＢ１２０のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of structured document DB120. 文字列抽出部１０４が抽出する文字列を説明するための図である。It is a figure for demonstrating the character string which the character string extraction part 104 extracts. 文字列抽出の処理を説明するための図である。It is a figure for demonstrating the process of character string extraction. 文字列頻度ＤＢ１２２のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of character string frequency DB122. 構造ＩＤ頻度ＤＢ１２４のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of structure ID frequency DB124. オフセットを説明するための図である。It is a figure for demonstrating offset. 索引ＤＢ１２６が保持する文書ＩＤごとの文書単位索引テーブル１２７のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of the document unit index table 127 for every document ID which index DB126 hold | maintains. 索引ＤＢ１２６が保持する文字列ごとの文字列単位索引テーブル１２８のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of the character string unit index table 128 for every character string which index DB126 hold | maintains. 構造化文書処理装置１０における索引登録処理を示すフローチャートである。4 is a flowchart showing index registration processing in the structured document processing apparatus 10. 構造化文書処理装置１０における検索処理を示すフローチャートである。4 is a flowchart showing search processing in the structured document processing apparatus 10. 実施の形態にかかる構造化文書処理装置１０のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the structured document processing apparatus 10 concerning embodiment.

Explanation of symbols

１０構造化文書処理装置
５１ＣＰＵ
５２ＲＯＭ
５３ＲＡＭ
５７通信Ｉ／Ｆ
６２バス
１０２構造化文書取得部
１０４文字列抽出部
１０６文字列頻度算出部
１０８頻出文字列抽出部
１１０構造ＩＤ特定部
１１２構造ＩＤ頻度算出部
１１４索引登録部
１２０構造化文書ＤＢ
１２２文字列頻度ＤＢ
１２４構造ＩＤ頻度ＤＢ
１２６索引ＤＢ
１２７文書単位索引テーブル
１２８文字列単位索引テーブル
１３０検索条件取得部
１３２キーワード文字列抽出部
１３４検索部
１３６構造化文書抽出部
１３８構造化文書出力部 10 Structured Document Processing Device 51 CPU
52 ROM
53 RAM
57 Communication I / F
62 Bus 102 Structured Document Acquisition Unit 104 Character String Extraction Unit 106 Character String Frequency Calculation Unit 108 Frequent Character String Extraction Unit 110 Structure ID Identification Unit 112 Structure ID Frequency Calculation Unit 114 Index Registration Unit 120 Structured Document DB
122 Character string frequency DB
124 Structure ID frequency DB
126 Index DB
127 Document Unit Index Table 128 Character String Unit Index Table 130 Search Condition Acquisition Unit 132 Keyword Character String Extraction Unit 134 Search Unit 136 Structured Document Extraction Unit 138 Structured Document Output Unit

Claims

Structured document holding means for holding a plurality of structured documents expressed in a tree structure ;
A character string frequency holding unit that holds a character string included in the structured document held by the structured document holding unit and a character string frequency that is an appearance frequency of the character string in the plurality of structured documents in association with each other. When,
Among the character strings held in the character string frequency holding means, for the frequent character strings having the character string frequency equal to or higher than a preset threshold, the frequent character string and the tree structure of the structured document Index information holding means for holding index information for identifying a character string to be specified by the index key in association with the index key using a structure ID indicating a position where a frequent character string appears as an index key ;
Of the character strings held by the character string frequency holding means, a frequent character string extracting means for extracting a frequent character string having the character string frequency equal to or higher than a predetermined threshold;
And Structure ID specifying means for specifying the structure ID of the frequently appearing character string the frequently appearing character string extraction unit and extracted
With
The structural ID identifying means, among the frequently appearing character string, the plurality of structures ID matches the predetermined condition, to grant group identification information identifying a group including the plurality of structures ID,
The index information holding unit holds the structure ID specified by the structural ID specifying unit as said index key, for the frequently appearing character string included in the group, the group identified as the structural ID of the index key A structured document processing apparatus characterized by holding information.

Of the character strings held by the character string frequency holding means, the structure ID of a frequent character string whose character string frequency is equal to or higher than a predetermined threshold, and the appearance frequency of the character string at the position indicated by the structure ID It further comprises a structure ID holding means for holding a certain structure ID frequency in association with each other,
The structure ID specifying means includes the group identification information for a plurality of structure IDs whose associated structure ID frequencies are equal to or less than a predetermined threshold among the frequent character strings held by the structure ID holding means. The structured document processing apparatus according to claim 1 , wherein:

Computer
Structured document holding means, the structured document holding step of holding a plurality of structured documents represented in a tree structure,
String frequency holding unit, associates the character string which the structured document holding means is included in the structured document that holds, the string frequency is the appearance frequency in the plurality of structured documents of the string and string frequency holding step of holding,
Index information holding means, among the character strings stored in the character string frequency holding means, said character string frequency is preset threshold or more frequently appearing character string to a corresponding frequently appearing character string, the structured Index information for identifying a character string to be specified by the index key is registered in association with the index key using a structure ID indicating a position where the frequent character string appears in the tree structure of the document. An index creation step ;
A frequent character string extracting step, wherein the frequent character string extracting unit extracts a frequent character string having the character string frequency equal to or higher than a predetermined threshold among the character strings held by the character string frequency holding unit;
A structure ID specifying means for specifying the structure ID of the frequent character string extracted by the frequent character string extracting means;
Run
The structure ID specifying means assigns group identification information for identifying a group including the plurality of structure IDs to a plurality of structure IDs that meet a predetermined condition in the frequent character strings,
The index information holding unit holds the structure ID specified by the structure ID specifying unit as the index key, and the frequently identified character string included in the group has the group identification as the structure ID of the index key. A structured document processing method characterized by holding information .

On the computer ,
Structured document holding means, the structured document holding step of holding a plurality of structured documents represented in a tree structure,
String frequency holding unit, associates the character string which the structured document holding means is included in the structured document that holds, the string frequency is the appearance frequency in the plurality of structured documents of the string and string frequency holding step of holding,
Index information holding means, among the character strings stored in the character string frequency holding means, said character string frequency is preset threshold or more frequently appearing character string to a corresponding frequently appearing character string, the structured Index information for identifying a character string to be specified by the index key is registered in association with the index key using a structure ID indicating a position where the frequent character string appears in the tree structure of the document. An index creation step;
A frequent character string extracting step, wherein the frequent character string extracting unit extracts a frequent character string having the character string frequency equal to or higher than a predetermined threshold among the character strings held by the character string frequency holding unit;
A structure ID specifying means for specifying the structure ID of the frequent character string extracted by the frequent character string extracting means;
And execute
The structure ID specifying unit assigns group identification information for identifying a group including the plurality of structure IDs to a plurality of structure IDs that meet a predetermined condition in the frequent character string,
The search information holding unit holds the structure ID specified by the structure ID specifying unit as the index key, and the frequently identified character string included in the group has the group identification as the structure ID of the index key. A structured document processing program for holding information .